[09:55:29] Amir1: Can I depool ms1? [09:55:38] sure! [09:58:45] when you have time, we should start removing x2 from dbctl [09:58:50] Amir1: done [09:58:57] <3 [09:58:59] Amir1: Yeah I will do it [09:59:06] Amir1: everyhting looks good now right? [09:59:25] let me check [09:59:31] (logs, etc.) [10:00:01] so far nothing jumped [10:00:13] ok [10:00:22] I am going to go ahead and reboot [10:01:07] awesome <3 [10:02:02] for the next batch of reboots, I think I can build an automation for msX and pcX [10:02:34] This is finally done https://phabricator.wikimedia.org/T376905 ! [10:03:45] xD [10:03:50] \o/ [10:04:28] I think this is the first time we have completed a reboot *before* the new set start [10:04:40] No, we've done it before [10:04:50] damn it [10:10:40] Amir1: ms1 in orchestrator now, x2 is gone: https://orchestrator.wikimedia.org/web/cluster/alias/ms1 [10:10:57] \o/ [10:11:05] ms1 repooled [10:11:09] I updated the docs yesterday, let me know if I missed anything [10:11:20] wohoo [10:17:20] Amir1: I am removing the hosts from x2 [10:17:29] 🍿 [10:18:11] Amir1: Done, please check [10:18:44] It's gone from eqiad.json [10:18:46] let me check logs [10:19:13] looks good [10:19:22] ok, removing from conftool in puppet [10:19:24] I set mw to ignore x2 [10:19:26] I will do valid_sections later [10:19:38] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1130970 [10:19:55] https://github.com/wikimedia/operations-mediawiki-config/blob/master/src/etcd.php#L121 [10:20:08] good! [10:23:51] Amir1: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1130970 looks good? [10:24:44] I loathe that regex, we have list of valid sections somewhere else and it's not searchable [10:24:56] but of course, not fault of this patch [10:25:07] I know where the valid section list is [10:25:13] But I want to remove it from dbctl first [10:25:16] and then from puppet itself [10:25:29] yeah, I mean when that exists, this regex is redundant and annyoing [10:25:41] so many times we forgot to add or remove a section to the regex [10:26:25] (specially since it doesn't show up in search) [10:26:57] my rant is more of a long-term change I want to see :D [10:27:08] Amir1: now time for this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1130972 [10:28:28] done [10:28:37] the binlog of ms1 is still going down: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=1h&var-server=db2142&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&viewPanel=28 [10:29:05] $ dbctl -s eqiad section x2 get [10:29:05] Execution FAILED [10:29:05] Reported errors: [10:29:05] DB section 'x2' not found [10:29:07] good! [10:29:56] can we add a note to curse x2 instead? [10:30:47] https://usercontent.irccloud-cdn.com/file/HmKtRcoV/grafik.png [10:30:48] :) [10:31:00] I should do something useful [10:32:02] Amir1: you're behind the zeitgeist, you have to miss-spell something in that meme now [10:32:10] (useful, you say?) [10:32:14] :D [10:32:55] I should invite the reporter from Atlantic to our root group? [10:33:40] Amir1: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1130974 this should be all [10:34:25] done [10:39:53] Amir1: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1130976 first steps towards x3 [10:41:19] one thing [10:41:55] fixed [10:43:10] Thanks! [11:15:58] PROBLEM - MariaDB sustained replica lag on s7 on db2220 is CRITICAL: 43.5 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2220&var-port=9104 [11:16:32] aha, I'll add waiting for lag to lower before repooling then [11:17:58] RECOVERY - MariaDB sustained replica lag on s7 on db2220 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2220&var-port=9104 [11:44:48] FIRING: [8x] MysqlReplicationLagPtHeartbeat: MySQL instance db2150:9104 has too large replication lag (51m 25s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [11:45:55] there are a lot of pages of s7 dbs with replication lag [11:47:08] I am in -operations [11:47:15] yep me too [11:47:21] just wanted to ping here for awareness [11:54:48] RESOLVED: [8x] MysqlReplicationLagPtHeartbeat: MySQL instance db2150:9104 has too large replication lag (59m 20s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [14:49:17] hello, do any of you know if there's a way to see how long mediawiki waits before timing out a connection to mysql? I know wgMaxUserDBWriteDuration and wgMaxExecutionTimeForExpensiveQueries exist but we're seeing cascading outages at Miraheze when 1 DB has performance issues where php-fpm workers exhaust that I'd like to limit [23:39:07] PROBLEM - MariaDB sustained replica lag on s4 on db1247 is CRITICAL: 40 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1247&var-port=9104 [23:40:07] RECOVERY - MariaDB sustained replica lag on s4 on db1247 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1247&var-port=9104