[10:39:54] ema, faidon, just a thought, what about getting a bag full of ripe atlas probes, and give them at Wikimania to community people who live in areas where there are little or no probes? [10:54:16] ripe atlas probes are free of charge after evaluation, right? [10:57:31] yep [10:58:15] and the ripe is always interested to have probes in as many diverse locations as possible [11:01:21] so we'd have to talk to them and see if they're interested [11:01:33] they don't even do that on their own conferences though :) [11:02:53] XioNoX: sounds like you want to become a ripe ambassador :P [11:14:30] i remember ripe distributing atlas probes at some events but not sure which ones [11:14:36] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Operations, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3303337 (10BBlack) We can get broader averages by dividing the values seen in the aggregate client status code graphs u... [11:15:33] it's just an idea, but they might be interested, and that could be useful for us, re the datacenter latency map [11:38:20] XioNoX: https://atlas.ripe.net/get-involved/become-a-ripe-atlas-ambassador/ [11:41:49] if you guys think it's a good idea, I can investigate [11:42:10] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Operations, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955371 (10Gilles) Would the use of xkey help here? It sounds like a single user action currently generates several pur... [12:19:17] 10netops, 06Operations: Init7 routing loop - https://phabricator.wikimedia.org/T166663#3303641 (10ayounsi) [13:12:40] XioNoX: seems like a cool idea to me. Gets us more data too :) [13:19:39] fwiw, I applied to RIPE for anchors for esams and singapore [13:19:48] as soon as they get back to us I'm going to submit a procurement request [13:21:38] nice [13:27:14] ema: while you're here, we need to get back on the maps+upload thing soon-ish, as it's going to block making easy progress in ulsfo [13:27:28] bblack: sure, I'm just finishing the 8.8 upgrades now [13:27:34] ema: the other related bit is the LVS refactoring, but both touch LVSes, probably simpler to sort them out serially [13:28:32] I looked at the LVS thing, and I think for the first phase ( T165765 ) it looks like there's no real refactoring necessary in puppet/pybal. [13:28:33] T165765: Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 [13:28:41] just updating node lists [13:29:31] by that I mean, at the top of modules/lvs/manifests/configuration.pp, I think we can add a host to multiple traffic classes [13:36:36] bblack: mmh, what would happen to $lvs::configuration::lvs_grain_class then? [13:38:11] yeah... [13:38:16] but that part was just for salt really [13:38:22] right [13:38:42] we can make up new labels or just kill those, since we don't use salt anymore [13:39:16] yeah I was trying to skim the puppet code to look for places where we assume 1 host <-> 1 class [13:39:30] yeah there's really not any I don't think [13:39:38] there will be some associated changes to router config too [13:40:15] in this first phase, basically if you're looking at how the salt grains divide things... those two concepts (traffic-class and pri/sec) are now combined [13:40:29] it's more like each host is either primary for traffic class X, or secondary for all. [13:40:56] but later on the second phase idea (if we ever get there) is a bit more complicated [13:41:14] (which is T165764 ) [13:41:17] T165764: Fully-redundant LVS clusters using Pybal per-service MED feature - https://phabricator.wikimedia.org/T165764 [13:48:38] bblack: should we start the procedure and merge https://gerrit.wikimedia.org/r/#/c/353054/ ? [13:49:06] high-traffic2 LVSs should be: lvs[4002,4004].ulsfo.wmnet,lvs[2002,2005].codfw.wmnet,lvs[3002,3004].esams.wmnet,lvs[1008,1011].eqiad.wmnet,lvs[1002,1005].wikimedia.org [13:49:18] pasting here just to double-check :) [13:56:55] well, one last thought in all of the tangle of pending things, we never re-did the storage layout [13:57:13] but the tweaks to that look fairly small and maps traffic is fairly low, etc [13:58:07] we probably wouldn't go through a forced set of restarts for that anyways, just let the natural weekly restarts pick it up as it goes [13:58:39] https://phabricator.wikimedia.org/T145661#3233742 was as far as I got on that. I didn't finish translating that into practical percentages to stuff in puppet I think. [13:58:49] (or further splitting the bins, if we want to try that. maybe later?) [14:00:00] let me see if I can at least reduce the new data to a set of realistic changes for puppet right quick [14:00:05] ok [14:09:57] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3304018 (10BBlack) This is what we had before (copied from far above): | Bin | SizeRange | StoragePct | StorageSize (1 node) | Disk | --- | --- | --- | --- | B... [14:13:09] ema: https://gerrit.wikimedia.org/r/#/c/356394/ [14:13:45] so maybe push that through first, so the upcoming week they restart into a better balance with maps considered. But honestly I don't expect it to be critical enough to hurt anything (more than we already are in general with occasional mailbox lag alerts) [14:13:58] bblack: looking [14:14:43] there's some arbitrariness/magic in mapping the calculated percentages to the final ones. It comes down to trying to make something make sense given the additional constraints that: [14:14:58] 1) there's two disks to alternate and each one's set of bins has to add up to exactly 50% of total storage [14:15:24] 2) the final bin is always undersized and can use more growth, because single objects can be such a large fraction of it [14:15:27] etc... [14:17:55] oh and post-merge, should puppetize and restart one random upload node just to be sure there wasn't some bad rounding effect that causes it to run out of space [14:18:16] the floor() in there is meant to prevent that, but then there's always little quibbles about filesystem allocation and whatnot, and we're right near the edge [14:19:18] bblack: maybe one per DC just to be sure? (different hdd sizes) [14:21:08] upload's disks are all identical presently [14:21:18] (until ulsfo's bigger new disks come online, anyways) [14:21:51] true, just checked :) [14:21:54] ok merging [14:22:00] they're all models "C" and "D" in: https://wikitech.wikimedia.org/wiki/Traffic_cache_hardware [14:22:27] (only maps and misc have disks other than the intel S3700 @ 400G) [14:26:55] tested on cp2026, all good [14:28:31] bblack: short break and then I'll proceed with https://gerrit.wikimedia.org/r/#/c/353054/ unless you have objections in the meanwhile :) [14:37:54] I'm good if you're good, but I'm leaving for a dr appt in a few mins :) [14:38:01] 10netops, 06Operations: Init7 routing loop - https://phabricator.wikimedia.org/T166663#3304151 (10ayounsi) 05Open>03Resolved BGP re-enabled, confirmed working. [14:38:22] it's slightly-scary, but we've done this before when collapsing past clusters (e.g. mobile and bits into text ages ago) [14:38:49] as long as cache_uploads have the new IPs, and you restart secondaries first then primaries on LVS and confirm that the ipvsadm output looks sane along the way, etc... should be fine [14:39:03] (and then later we can change DNS and eventually remove the old maps IP) [14:39:42] if you want to wait until later or tomorrow when I'm around, that's fine too, judgement call :) [14:40:32] (backup would almost certainly be prudent if this were text or upload moving, but it's "just" maps, so...) [14:54:42] bblack, ema: I remember that some sysctl values for LVS are configured via modules-load.d to mitigate bootup races. That turned out to not a complete fix for the nf_conntrack case: https://phabricator.wikimedia.org/T136094#3298506 [14:54:52] it might be related to specific kernel modules [14:55:34] kmod probably just calls the init() of the kernel module, it might be that some modules like nf_conntrack simply take longer to initialise [14:55:51] so might be worth doublechecking whether that's currently set reliably for LVS [14:57:45] from what I can tell all sysctl settings should be idempotent, so maybe the best fix is to re-run systemd-sysctl in a separate service when multi-user.target has been reached, I'll test this tomorrow [15:03:36] moritzm: currently all load balancers have the proper values set [15:03:46] and I think they did get the settings right after reboot [15:03:58] but yeah perhaps it's module-dependant [15:04:21] ok, it might also be a matter of what the module actually does, e.g. nf_conntrack also loads dependant modules for conntracking other protocols etc. [15:05:16] at any rate, sysctl.d(5) explicitly mentions the modules-load.d workaround [15:05:47] so if it doesn't work as advertised it's probably something worth reporting upstream :) [15:07:11] yeah, I asked in #debian-systemd whether that was a known bug, but they were unsure [15:07:24] I think I'll report it upstream, if only let them fix the docs [15:07:29] right [15:39:27] ok, IP move procedure started. Now checking that upload nodes get the right maps addresses [15:42:35] checked on one upload node per cluster, they look good. Forcing puppet run on cache_upload [15:53:19] pybal.conf diff on lvs1011 looks sane after puppet agent run, restarting pybal there [15:54:05] 10Traffic, 10DBA, 06Operations: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304453 (10jcrespo) [15:56:15] ipvsadm also seems fine there, maps IPs replaced with upload ones [15:56:33] 10Traffic, 10DBA, 06Operations: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304486 (10jcrespo) {P5513} [16:02:00] lvs1011/lvs1008 done, carrying on with the real load balancers starting with ulsfo [16:04:07] 10Traffic, 10DBA, 06Operations: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304501 (10jcrespo) [16:05:10] 10Traffic, 10DBA, 06Operations: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304453 (10jcrespo) Please note that **the ticket us public, but the list of ips is not**, do not take private data from the ips list and copy it here. [16:08:03] cp4004 looks sane, proceeding with cp4002 (hence actual traffic now) [16:10:16] maps traffic flowing fine on cp4006 [16:12:18] heh I've been logging ipvsadm's output under /tmp before and after pybal restart but on the primaries the activeconn columns ruins the diff :) [16:12:23] everything ok so far [16:28:29] ulsfo and codfw done [16:33:06] oh, goodbye to the maps-specific dashboards of course [16:39:05] esams done [16:48:27] all done [16:51:53] there's been a small 503 spike in upload due to cp1074, mbox lagging since > 2h [16:54:50] varnish-backend restarted, it was running since 3d only [17:05:52] so the situation LGTM, I'm gonna leave puppet disabled on the cache_maps hosts Just In Case [17:06:19] (confirmed that none of them is receiving any requests except for the various checks) [17:53:05] ema: \o/ nice work [17:57:29] re: the rest of the maps-decom stuff, my current thinking is first just wait ~24h in the current state (for ease of revert), and re-evaluate from there. If things aren't looking awful (e.g. rate of mailbox lag failures goes way up), then maybe tomorrow we swap the DNS and start planning out the remainder of the cleanup. [19:15:42] 10Traffic, 10DBA, 06Operations: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3305582 (10jcrespo) This started at exactly 3am and ended at exactly 18pm. All very weird.