[09:20:00] ██╗██████╗ ██████╗ ███████╗██╗ ██╗██████╗ ███████╗██████╗ ███╗ ██╗███████╗████████╗███████╗ ██████╗ ██████╗ ██████╗ [09:20:03] ██║██╔══██╗██╔════╝ ██╔════╝██║ ██║██╔══██╗██╔════╝██╔══██╗████╗ ██║██╔════╝╚══██╔══╝██╔════╝ ██╔═══██╗██╔══██╗██╔════╝ [09:20:07] ██║██████╔╝██║ ███████╗██║ ██║██████╔╝█████╗ ██████╔╝██╔██╗ ██║█████╗ ██║ ███████╗ ██║ ██║██████╔╝██║ ███╗ [09:20:12] ██║██╔══██╗██║ ╚════██║██║ ██║██╔═══╝ ██╔══╝ ██╔══██╗██║╚██╗██║██╔══╝ ██║ ╚════██║ ██║ ██║██╔══██╗██║ ██║ [09:20:15] ██║██║ ██║╚██████╗██╗███████║╚██████╔╝██║ ███████╗██║ ██║██║ ╚████║███████╗ ██║ ███████║██╗╚██████╔╝██║ ██║╚██████╔╝ [09:20:19] ╚═╝╚═╝ ╚═╝ ╚═════╝╚═╝╚══════╝ ╚═════╝ ╚═╝ ╚══════╝╚═╝ ╚═╝╚═╝ ╚═══╝╚══════╝ ╚═╝ ╚══════╝╚═╝ ╚═════╝ ╚═╝ ╚═╝ ╚═════╝ [09:20:36] hashar vgutierrez eddiegp Pchelolo gwicke chasemp paravoid dcausse Platonides Ivy SMalyshev phedenskog musikanimal bearloga paladox CIA marlier Krenair _joe_ bd808 madhuvishy elukey greg-g ema Danny_B stashbot Matthew_ bblack MaxSem gehel Snorri moritzm gilles Steinsplitter mark legoktm volans [10:27:58] 10Traffic, 10Operations, 10Page-Previews, 10RESTBase, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#3957874 (10phuedx) >>! In T184534#3954704, @BBlack wrote: > I think to really comprehend the right fix here, I'd need to rewind a little and figure o... [12:53:41] 10Traffic, 10Gerrit, 10Operations, 10Phabricator, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3958098 (10demon) https://gerrit.googlesource.com/plugins/motd/+/master could be useful on gerrit's... [14:07:26] ema: I think we've hit 1w as of roughly now on upload-ulsfo? [14:07:29] still no ramps! [14:07:39] https://grafana.wikimedia.org/dashboard/db/varnish-mailbox-lag?orgId=1&from=now-7d&to=now&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=upload&var-server=All [14:08:09] \o/ for now. I think that's a pretty solid improvement indicator, although a second week will help solidify that view [14:09:33] bblack: one week exactly, yes! [14:42:03] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3958405 (10Jgreen) [14:44:06] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3958412 (10Jgreen) We've done all hosts but civi1001, frdb1001, and frdb1001 which require fundraising downtime. [14:44:15] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3958413 (10Jgreen) 05Open>03Resolved [14:44:18] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3958414 (10Jgreen) [16:24:48] the 30 days view of that graph is great [17:07:56] <_joe_> bblack: we just had a big spike of mailbox lag on 4023 btw [17:08:01] <_joe_> you jinxed it [17:09:15] <_joe_> it's still a small peak, and seems related to an increased backend activity [17:20:43] _joe_: related to wmf.20 attempts and such? or natural traffic? [17:21:37] <_joe_> no idea, maybe the former [17:22:16] yeah I see 4023, still spiking up so far [17:22:30] but the old ramps came up slower, and reached the multi-millions, this is something else [17:23:56] err no, I was wrong about ramp slopes [17:24:03] the slop does look very similar to the old ones [17:24:07] *slope [17:24:16] getting into the few-hundred-K range in ~15m [17:24:31] if it's a ramp like the old ones, it will probably keep building for quite a while though and reach millions [17:26:58] it seems like possibly a scan-attack sort of thing (which often are merely misguided rather than malicious) [17:27:09] but it's odd it's hitting one server harder than the rest in that case [17:28:01] maybe huge traffic influx on a particular media file too [17:44:09] !log cp4023: experimental, "renice -19 39007" (backend cache-timeout aka expiry thread), to see if mbox lag resolves on its own quicker [17:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:37] !log cp4023: now seems to be leveling off on lag and decreasing objhdr locks. either expiry thread prio helped (which argues for our prio-related patches) or it was naturally going to end? [18:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:52] ema: whenever (monday?), take a peek at cp4023 around the time of these SALs [18:08:24] !log cp4023: after a brief period of levelling off a bit: sharp, steep recovery of mbox lag ramp back to ~6K. not sure if this is a new floor or will drop further, but seems pretty ok. [18:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log