[04:47:10] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3659506 (10Dzahn) We have the repo, and i see your change Amir, thank you! Before i merge the content itself i started with puppet code to in... [06:23:29] hello traffic team, 3 backend fetch failures this morning for esams: cp3040, cp3030 and cp3032. The first two got their varnish backend restarted, the last one still no because the 503s are quiet and I'd like to check in with you before knocking down the third text cache in a row :) [06:23:46] elukey: hey! [06:23:59] <3 [06:24:37] timeline ~5:18 - ~6:02 UTC [06:24:44] hello ema ! <3 [06:25:25] so cp3032 hasn't been restarted and the 503s stopped on their own? [06:35:41] yep exactly [06:36:15] cp3040 was the first IIRC, that self recovered, then I restarted cp3030 while it was throwing 503s [06:36:25] and finally cp3032, that auto-recovered [06:36:48] recently the pattern seems to be two esams backend self recovering [06:37:06] this morning was different :) [06:37:22] it is really weird though, it only happens afaics in the super early EU morning [06:37:29] the rest of the day looks fine [06:37:53] maybe it is bot related or time related? (1d caches expiring or similar?) [06:41:44] so yeah we've been discussing this a lot [06:42:54] it's most likely not related to expirations, as there's no way they all happen at the same time (grace/keep are fixed, but ttls/time of cache entry are all different) [06:43:32] a bot or some other type of regular morning pattern from certain clients seems to be the most likely cause [06:44:04] the rate of objects entering the cache raises up fast before the issue shows up IIRC [06:47:27] eg look at cp3030: [06:47:39] backend objects made: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=45&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams%20prometheus%2Fops&from=now-3h&to=now [06:47:44] fetch failed: [06:47:48] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams%20prometheus%2Fops&from=now-3h&to=now [06:48:47] there's an initial increase in cached objects ~5:25, the it goes all the way up at 5:31 [06:49:03] fetches start failing later on at 5:37 [06:50:03] so the theory here would be: some client(s) show up at the same time in the EU morning with a request pattern that causes lots of misses (objects to be fetched in cache at the same time) [06:50:28] and that leads to the infamous series of circumstances causing fetch failures [06:52:18] as a way to figure this out I've tried to look at pivot data for the hosts affected shortly before the issue arises, mostly UA, IPs, req_uri, ... [06:52:51] list them and see if there's a subset of those that match with other days [06:53:20] nothing really stood out though, but surely the analytics team would know better how to do this :) [07:45:14] I can definitely help if you want, maybe we can check via beeline the requests from 5:25 -> 5:31 [07:45:26] (for cp3030 only) [07:45:46] it can be done in pivot too but with 1/128 of the data [07:46:21] yeah so the optimal thing would be: query prometheus to get the hostname and time/day of failed fetches for the last, what, 20 days? [07:47:11] then come up with some magic query that highlights the subset of IPs/UAs/URIs common to most of those a few minutes before fetch failures start [07:48:07] or, as a start, the client causing most misses would probably be fine too :) [10:40:42] 10Traffic, 10Operations, 10Performance-Team (Radar): Upgrade to Varnish 5 - https://phabricator.wikimedia.org/T168529#3660161 (10ema) varnish-modules backported and uploaded to experimental. In my comment above I forgot to mention libvmod-vslp, which conveniently we do not need to care about any longer as i... [10:45:18] 10Traffic, 10Operations, 10Performance-Team (Radar): Upgrade to Varnish 5 - https://phabricator.wikimedia.org/T168529#3660177 (10ema) Oh, and `s/return (fetch)/return (miss)/`. [14:50:35] basic v5 support: https://gerrit.wikimedia.org/r/382464 [14:50:49] hiera calls all in the wrong place, please yell at me [14:51:41] https://varnish-cache.org/docs/5.1/reference/vmod_directors.generated.html#obj-shard [14:52:07] ^ there's at minimum a .reconfigure() call after all the backends are added. there's a bunch of other ways to tune it, too, it's a lot more complex than vslp [15:01:29] yup, that's gonna require a lot of experimentation I guess [15:18:41] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-eqiad: Interface error on fasw-c-eqiad:vcp-255/1/0 - https://phabricator.wikimedia.org/T177333#3661112 (10ayounsi) 05Open>03Resolved a:03ayounsi Chris replaced c1a-0 <-> c1b-1; interface errors are gone. [16:21:34] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: vcp port down on fasw-c-codfw - https://phabricator.wikimedia.org/T177332#3661454 (10ayounsi) 05Open>03Resolved a:03ayounsi Papaul reseated the cable, no more issues. [16:27:52] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3661490 (10ayounsi) 05Open>03Resolved Test in codfw was successful, no packet loss/issue. [16:29:42] ema, bblack what's the next action item for me on Varnish5 ? [16:30:35] XioNoX: merge all the libvmod-netmapper / varnish4 outstanding CRs, packages look good! [16:31:59] oh yeah! [16:33:45] ema: another thing to consider: esams has higher rates of 429 responses in general. all the DCs show occasional spikes (we'd expect that from occasional "abusive" api/scraper stuff), but esams spikes go much higher, and they happen all the time with some kind of pattern [16:34:10] they may not be the problem itself, but could be interacting with the problem, or a leading/trailing indicator of heavy traffic that induces a related problem, etc [16:34:14] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&from=now-7d&to=now&var-site=esams&var-cache_type=text&var-status_type=4 [16:34:47] the first day of those big esams 429 spikes was around Sept 27 [16:35:21] ema, why can't I merge https://gerrit.wikimedia.org/r/#/c/382174 ? [16:35:44] actually any of the patches [16:35:52] rights issue? [16:35:52] you probably need to click the [X] next to the jenkins -1 [16:36:03] yay! [16:36:34] when you get to netmapper, be sure to do the master-branch ones first (it should make those patches unnecc to merge in the debian branch) [16:37:16] (in general, with our stuff in gerrit, merge things in the base-most branches first) [16:39:23] everything merged [16:39:50] hum, not sure in which order I did it :/ [16:40:14] oh yeah, I merged master before debian :) [16:40:48] \o/ [16:40:57] yeeee [16:49:19] bblack: oh, interesting! [16:50:56] I think the first 503 spikes in text esams happend earlier than the 27th, but still very interesting [16:51:33] on the 20th, for example [16:51:59] right, but when did we fix the more-general problems (the keep revert)? [16:52:06] the earlier esams text 503s might have been related to that [16:52:10] true [16:53:08] oh the revert was the 18th [16:53:18] + a few days of chaos after the revert before it was all clean on that front again [16:53:25] and the backend storage pattern stabilization on the 19th [16:54:38] let me set max_connections to 0 on cp1008 and see if that does what we think :) [16:56:46] https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1&var-cluster=text&var-site=esams&from=now-30d&to=now [16:57:08] ^ looking at esams hitrate stuff there, it's about what we'd expect (fewer remote hits in eqiad after the storage-pattern stuff) [16:57:36] but it's interesting to see how big those spikes downwards of "Int" are in the Disposition graph (which are almost surely the same spikes as the 429s, which are Int) [16:57:51] those 429 spikes are large fractions of all requests during their brief spike [16:58:43] for example, the 429 rate spike at 2017-09-27T19:40 peaks at 10.44K/sec [16:59:27] out of ~47K/sec total reqs at that mark [17:00:09] the 429s might be completely unrelated of course, but they're still interesting. something is hitting them and treating them as "retry this immediately" I think :P [17:00:36] at 1/4 of all traffic, the 1/1000 logs probably have evidence of the 429 cause too [17:00:43] (on those spike marks) [17:04:50] pivot is interesting too [17:05:08] http://bit.ly/2kp4Bzb [17:05:08] http://bit.ly/2kp4Bzb [17:05:11] bleah [17:06:33] mostly enwiki, uri path / [17:07:38] how does that get rate limited though? Surely / is cached? [20:25:09] 10netops, 10Operations: Implement RPKI (Resource Public Key Infrastructure) - https://phabricator.wikimedia.org/T61115#3662281 (10ayounsi) Added the key pair generated for ARIN to the pw repository. Generated a SOA for the v6 ARIN prefix, if no issues after propagation, I'll generate the last two ARIN v4 SOAs.... [20:55:18] 10Traffic, 10Commons, 10Multimedia, 10Operations, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3662372 (10Dispenser) Since this seems to be impossible. Would adding interstitial for Facebook referrers be doable? I imagine that we'... [23:51:48] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3663063 (10Dzahn) Next we should figure out: - who should have Gerrit permissions for +2/merge on the content repo - give them the permissi... [23:53:33] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3663069 (10Dzahn) By the way i couldn't just merge Ladsgroup's change in the new content repo as i just have +1 rights there, but not +2.