[07:30:44] ema! [07:33:12] <_joe_> kill -HUP ema [07:33:14] <_joe_> :D [07:33:44] :) [07:33:49] jijiki! [07:34:18] one does not simply kill -HUP ema [07:34:51] <_joe_> well it worked did it [07:35:00] <_joe_> next time I'll strace him [07:35:06] <_joe_> to verify if he's sleeping [07:35:09] how about my privacy hey [07:35:25] <_joe_> "privacy" [07:35:29] <_joe_> you're so 2000s [07:35:44] <_joe_> anyways, sorry, jijiki had important business to discuss :P [07:36:22] ema: we have separate caches for php7 and hhvm [07:36:56] in modules/varnish/templates/text-frontend.inc.vcl.erb [07:37:36] our goal is to use php7 for API as well, right now we do have some API requests served via php7 [07:37:41] getting straced at 9 AM is kinda aggressive /o\ [07:38:17] but that when clients which already have the PHP_ENGINE=php7 cookie [07:38:35] we had an idea of having API servers serving only via php-fpm [07:39:13] but we have the issue that those clients will not hit the right caches [07:39:32] since the cookie will not be set at all [07:40:06] can we think of a workaround this case? [07:41:34] <_joe_> the problem is we can't use Vary or request [07:42:15] <_joe_> only thing we could use is a resp header [08:03:13] how about setting X-Seven when req.url indicates we're dealing with an api request? [08:05:52] we will have servers serving php-fpm only one by one [08:06:18] but we can have apache set it when we hit an php-fpm only api server [08:09:36] _joe_ what do you think? [08:09:58] <_joe_> so [08:10:12] <_joe_> we can set X-Seven *and* the cookie at the varnish layer [08:10:16] <_joe_> we wanted to avoid that [08:10:24] <_joe_> but we can go that route, yes [08:10:41] <_joe_> it means doing a chance extraction in VCL [08:11:11] <_joe_> jijiki: the issue is that Vary: can only depend on request headers, not response headers [08:11:21] <_joe_> so I don't think there is an easy solution to the problem [08:13:06] <_joe_> I think in the end we're better off just not trying to solve this problem :D [08:13:22] that sounds even better, yes! :) [08:13:26] <_joe_> but if anyone is willing to write the vcl logic, I'm happy to help [08:41:27] ema: different idea [08:42:23] what if we set the cookie to full wikis [08:42:37] on the varnish layer though [09:12:11] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) >>! In T196066#5155909, @Ottomata wrote: > I don't think Magnus would build it into librdkafk... [10:05:31] <_joe_> ema: I'd like to deploy my pybal patch to proxyfetch ASAP [10:06:33] _joe_: ok! [10:06:43] <_joe_> as in this week [10:06:54] <_joe_> can you take a look at my patch today? :) [10:07:37] that implies releasing a new pybal version [10:07:42] that's going to be fun [10:07:45] <_joe_> I'm aware [10:07:47] <_joe_> why? [10:07:54] ~1 year since the last one [10:08:02] <_joe_> can't we just backport this to whatever release branch we have? [10:08:11] <_joe_> we do have a release branch, right? [10:08:16] indeed [10:08:25] <_joe_> and yes, you need to do a proper release sooner or later [10:08:28] I'll try to cherry-pick your commit and that's it [10:08:39] maybe the k8s support as well [10:08:45] <_joe_> yeah [10:08:55] <_joe_> not sure we'll use it rn [10:09:00] <_joe_> but maybe for new services [10:09:20] so you just messed with me at 2018's hackathon? [10:09:24] * vgutierrez cries in the corner [10:09:29] <_joe_> I said right now [10:09:31] <_joe_> not ever [10:09:33] ;P [10:09:41] <_joe_> you can complain with your lunch mate about that [10:09:47] will do [10:09:49] }:) [13:16:21] fixed a bug due to which we were only counting for varnish-be and not ats-be on dashboards such as varnish-caching [13:16:33] data looks much better now :) [13:16:34] https://grafana.wikimedia.org/d/000000500/varnish-caching?refresh=15m&orgId=1&var-cluster=cache_upload&var-site=ulsfo&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [13:20:21] 10netops, 10Cloud-Services, 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10chasemp) I was 6 months off on my estimate for this :) [13:53:23] ema: nice catch :D [14:04:30] cp3038 (upload) has been entering/exiting sick state since a few minutes now. Restarting varnish-be [14:10:05] OCSP stapling improvements in httpd coming - https://github.com/icing/mod_md/wiki/V2Design [14:23:33] interesting [14:26:00] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp4026.ulsfo.wmnet'] ` The log can be... [14:54:30] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Jgreen) @bblack circling back on this, do you still see any issue now after the Silverpop SSL improvements? [15:08:29] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4026.ulsfo.wmnet'] ` and were **ALL** successful. [15:08:33] \o/ [15:09:27] nice [15:15:02] now all upload traffic in ulsfo is served by ATS :) [15:15:29] \o/ [15:16:16] if anyone can help investigating potential causes for cp1083's crash that would be great [15:16:20] https://phabricator.wikimedia.org/T222620 [15:16:28] the host is still depooled [15:16:44] nothing interesting in kern.log.1 [15:16:57] it just committed seppuku [15:26:09] there was anything in racadm getsel? I see log cleared at 13:30:25 [15:26:54] ah, that I didn't check [15:27:43] * volans wonders what cleared them then [15:28:24] I've power-cycled with racadm serveraction powercycle [15:28:31] and that's it [15:31:18] from racadm getraclog I just see your login at 15:42:11 and subsequent reboot messages [15:33:00] mmmh, that's actually in the future :D [15:34:02] lol [15:34:30] I wish I could travel to the future, just one minute after meetings [15:39:41] lol [16:11:27] https://news.ycombinator.com/item?id=19828702 [16:11:47] "The archive.is owner has explained that he returns bad results to us because we don’t pass along the EDNS subnet information." [16:14:32] yeah insane... [16:15:24] I could understand sending back a generic non geolocated and thus badly performing IP, but breaking one's own site for a percentage of users? [16:15:30] the logic fully eludes me [16:20:31] yeah [16:20:59] 1.1.1.1 + edns-client-subnet is something I thought about a lot back when 1.1.1.1 was all in the news. [16:22:05] at first I was annoyed and worried about it, and didn't think it made sense from the pov of an operator like us (since the user IPs leaking through ECS to our authservers also eventually connect to us directly, so there's no privacy gain and there's geolocation loss). [16:22:43] but (a) It does make sense from the pov that many (perhaps most by popularity now?) sites' authdns aren't self-hosted and thus are centralizing that leakage in various 3rd party dns hosts [16:23:20] + (b) cloudflare's argument about their expansive edge network mitigating the effects for a situation like WMF's sounds completely reasonable. [16:26:27] but on the other other hand, I think eventually when we get the time to address it, we'll probably start working on better-than-geolocation solutions, and the lack of ECS will start mattering at least a little more then. But hopefully (b) still makes it reasonable-ish. [16:27:20] ("better" meaning things like actually 1/1K sampling ping data from our own random users' clients to all our edges to build a dynamic database of real per-network latencies for those networks we can get data on, as a more-accurate overlay of the basic geographic stuff) [16:28:25] (where "ping" of course is not ICMP, but minimal beacon https reqs coming out of JS/workers) [16:57:06] I guess it's a move to try to force DNS providers to implement EDNS, good luck though... [17:28:00] 10Traffic, 10Operations: false positives in check_trafficserver_config_status - https://phabricator.wikimedia.org/T222642 (10Vgutierrez) [17:28:56] 10Traffic, 10Operations: false positives in check_trafficserver_config_status - https://phabricator.wikimedia.org/T222642 (10Vgutierrez) p:05Triage→03Normal [21:23:32] 10Domains, 10Traffic, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) 05Open→03Stalled [22:17:27] 10netops, 10Operations: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) > After checking the core the engineering team has an update on what happened > “The thread that is holding the lock seem to have corrupted stack and is holding the lock for a very long time. Other t... [22:17:53] making amazing progress on that router rebooting on its own https://phabricator.wikimedia.org/T221156#5162750 [22:18:04] hi, https://phabricator.wikimedia.org/T222418 looks like a possible continuation of https://wikitech.wikimedia.org/wiki/Incident_documentation/20190503-varnish [22:18:12] cp1089 backend fetches [22:18:55] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Dzahn) p:05Triage→03High