[08:27:49] Emperor: It's time to migrate apus to IPIP + maglev, same thing that we did with swift, when you have the chance please take a look to https://gerrit.wikimedia.org/r/q/topic:%22T387290%22 thx <3 [08:41:33] OK [08:44:22] I've +1 the codfw change, but I'm less sure about the eqiad one [08:48:26] Emperor: fixed the eqiad one, thanks for the review [08:55:21] nice :D this will be the first not dry-run execution of the cookbook 🍿 [08:57:04] * volans hides [08:57:43] "cookbook" pings you volans ? [08:58:10] I'm everywhere :D [08:58:22] volans: so you actually bind against 0.0.0.0 [08:58:27] indeed [08:58:29] :D [08:58:55] quick, switch to IPv6 ;p [08:59:05] so far so good, cookbook is running puppet on moss-fe[2001-2002] [09:00:39] done, now it's time for lvs[2013-2014] [09:01:00] PROBLEM - MariaDB sustained replica lag on s8 on db1193 is CRITICAL: 199 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1193&var-port=9104 [09:03:21] volans: hmmm not a bug.. but the cookbook is too silent when validating IPIP traffic [09:03:54] some more logging would be helpful? [09:03:58] yep [09:04:02] easy fix :) [09:04:03] it doesn't print anything at the moment [09:04:15] ythat's because you're shy [09:04:41] restarting pybal now :D [09:06:00] RECOVERY - MariaDB sustained replica lag on s8 on db1193 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1193&var-port=9104 [09:06:49] jynus: IIRC you have a long-standing phab item open about bugs in mediawiki file/metadata handling? Do you have the number to hand, please? [09:07:22] PROBLEM - MariaDB sustained replica lag on s4 on db1244 is CRITICAL: 80.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1244&var-port=9104 [09:09:00] Emperor: well, not exactly what you ask, but I think what you mean [09:09:46] that'll do nicely I'm sure :) [09:10:24] RECOVERY - MariaDB sustained replica lag on s4 on db1244 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1244&var-port=9104 [09:10:34] I remember the existence of the ticket more accurately than it's scope [09:12:26] its [09:13:43] I cannot find it quickly, there too many open tickets with files missing or corrupted [09:14:41] Emperor: do you know the tag for mw media files handling? [09:15:33] Emperor: I'm done with apus, thx <3 [09:15:45] I think it is MediaWiki-File-management [09:15:45] jynus: https://phabricator.wikimedia.org/tag/mediawiki-file-management/ ? [09:15:47] yeah [09:16:21] vgutierrez: ack, thanks [09:18:06] jynus: T289996 perhaps? [09:18:07] T289996: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996 [09:18:25] ah, you found it [09:18:45] yeah, this is only about database known issues, not as much about how it got corrupted [09:22:28] I see why I couldn't find it, it was because I wrote the text but didn't create the ticket myself [09:22:59] yeah, likewise I had remembered it as "your" ticket [09:23:22] https://phabricator.wikimedia.org/T289996#7324943 [09:24:02] thank you also for reminding me of T359176 ;-p [09:24:03] T359176: Long-titled archived files can get its path metadata truncated due to not having enough storage space, leading to orphan, non accesible files (was: Two files on commons have invalid UTF-8 characters in path metadata) - https://phabricator.wikimedia.org/T359176 [09:26:05] In theory that is fixed, but the data is still bugged [09:26:23] fixed only for future uploads [09:27:22] I still have ~100K files I cannot backup, 40K are accounted for, but other's I hadn't had the time to go 1 by one [09:35:29] slacker ;p [09:35:59] how long can 60k files take to check by hand? :) [09:37:54] Emperor: sorry to bother you again but apparently you own thanos-swift as well :D please take a look to https://gerrit.wikimedia.org/r/q/topic:%22T387293%22 when you have the chance, thx <3 [09:38:22] 😿 [09:39:53] LGTM [09:40:54] vgutierrez: note that thanos-swift unlike apus and ms-swift is _one_ cluster spread across two DCs, not one cluster-per-dc [09:41:34] I don't _think_ this is going to trip you up, but it's worth being aware of [09:42:15] ack [10:00:04] codfw done :D [10:00:57] RX packets 97240 bytes 316652572 (301.9 MiB) --> traffic flowing on ipip0@thanos-fe2001 <3 [10:11:34] all ready, thanks again Emperor [10:14:32] NP :) [11:28:23] regarding the parsercache outage, it seems jobs/scripts try to update the pc entry for one page 4,000 every second for 10+ minutes [11:29:56] that seems ... not ideal behaviour [11:32:22] yeah... There is a pathological path somewhere in mw [11:32:48] Amir1: I just realized one thing, not as a possible cause but as a possible thing that could make things worse [11:33:15] After one of the outages, we enabled the query killer on parsercaches [11:33:30] while normally that only kills SELECTS [11:33:46] if tries to kill everything locked if it gets close to max_connections [11:34:01] I wonder if that could be the reason for rollbacks [11:34:15] AND if we should disable that part of the query killer [11:36:33] It is not ideal but I think the main objective right now is to make sure that pathological path is eliminated. No matter what we do, 5K/s write on a single server is just too much [11:37:24] yeah, I agree, but I wonder if that caused a bad thing (overload) to cause a worse thing (stalls due to writes converting to rollbacks) [11:37:38] just wanted to reflect that, will add it to the ticket [11:37:49] and you can later evaluate what's configured there [11:37:58] and see if it needs further tuning [11:40:16] That'd be great. Thank you! [11:42:38] Done [13:02:16] Amir1: I am going to do https://phabricator.wikimedia.org/T387433 now [13:02:31] Thanks! [13:02:34] I will let you know when you can take the old eqiad master if you have any schema change pending [13:02:48] yeah, I think there is one for page table [14:20:14] Found the culprit for the PC outage: https://logstash.wikimedia.org/goto/4ff0a575239df1104df57a02da31d4a2 [14:20:39] article of DAVID_EVRARD in French Wikipedia [15:21:44] Amir1: db1201 old s6 master is ready for you [15:27:39] marostegui: thanks. Currently neck deep in the pc issue, will get to it once I'm done [15:27:52] Amir1: no rush at all, it was just a heads up [15:27:59] Thanks <3 [17:26:28] just a more specific heads-up given the cookbooks downtime dbs etc - we're starting the switchover live test