[10:41:44] I deployed the silenced hosts table at https://zarcillo.wikimedia.org/ui/hosts as a prototype/mvp. Meanwhile I got the usual air conditioning flu/cold and I'm going through my day a bit slowly [10:42:12] federico3: thanks - take care! [11:04:11] I've created this task: https://phabricator.wikimedia.org/T429317 would you like me to tag it in some specific way to get it into the queue? [11:05:49] slyngs: that's good enough, we will take it from there thanks [11:06:17] Thank you :-) [11:44:55] folks heads up I'm going to start depooling the db nodes in codfw rack A5 in the next few mins for the switch upgrade T428020 [11:44:56] T428020: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020 [11:45:26] topranks: thanks [11:45:27] federico3: ^ [11:45:58] yep, as planned, thanks for the heads up [12:25:57] topranks: do you have a rough idea of completion time? [12:27:52] cezmunsta: the window is for 30 minutes [12:28:02] ack ty [12:28:03] but usually it happens faster [12:28:30] Just spotted the email [12:29:06] cezmunsta: you should keep an eye on T426197 [12:29:07] T426197: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197 [12:33:25] ty [12:33:57] cezmunsta: just waiting for the switch to complete it's upgrade now, it's taking a little longer than expected [12:34:20] probably another 10 mins, I'll post on task and in #wikimedia-sre when done [12:34:30] +1 [12:46:58] PROBLEM - MariaDB sustained replica lag on s2 on db2175 is CRITICAL: 28 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [12:47:58] RECOVERY - MariaDB sustained replica lag on s2 on db2175 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [12:52:47] cezmunsta: thanks for your pateince, we are all done now if you want to go ahead [12:53:03] things look good in general I am repooling the db hosts [12:54:04] topranks: thanks [14:53:34] federico3: I presume that it is expected that test-s4 has broken replication? [15:06:54] cezmunsta: I've always said that I'd like to see it working most of the time unless it's being worked on for something. But it's been broken for a month now and I've been to busy to keep chasing this [15:09:46] marostegui: is there a defined process in terms of managing that cluster, it looks pretty small on disk at least [15:11:43] cezmunsta: there's not. But I've always thought that if something needs to be tested it's fine to break it, but once done, please leave it running for the next person tests. If I need to test something and I find it broken, now that person needs to investigate, fix and then test. Which is not very efficient [15:12:09] All data there is volatile so we can simply stop replication and reconfigure it from the current master position [15:22:54] cezmunsta: I'm aware, it broke during a switchover test and does not want to start again, as soon as I find some time I'll try to reset the whole replication chain to see if it restarts [15:27:30] Please could you be more specific there? db-test100[1,2} and db-test2001 are OK. At least 2 of the broken ones are looking for GTIDs that don't exist [15:28:03] sorry, db-test100[2,3} are the OK replicas [15:30:19] Any reason not to use the clone cookbook to fix the broken ones? [15:36:22] i don't remember if we ever tried the clone cookbook on them as the test-* hosts are VMs [15:36:52] @marostegui are these new hosts being set up? If so is there a task I can add to the reboot task? [15:36:55] https://www.irccloud.com/pastebin/06eoMETD/ [15:46:15] federico3: I have not touched test cluster in a long time, I think you set up all those hosts [15:46:52] @marostegui I was asking about a different set of hosts (see the paste above) [15:48:11] federico3: db1300 is amir's, the others I'll check tomorrow, I'm out already and on my phone [15:53:08] federico3: but you can also check their puppet roles and see where they belong [15:53:36] clone -> AssertionError: db1176.eqiad.wmnet is not master according to zarcillo [15:54:18] federico3: does that mean that Zarcillo data is inaccurate? [15:54:57] cezmunsta: the testbed hosts are special, being a testbed [15:55:26] I am not sure if that answers the question. The UI shows db1176 (master) [15:55:53] especially db1224 [15:57:02] @marostegui i checked and they are not in puppet, so I asked :) [15:57:12] (anyhow nothing urgent) [15:57:20] federico3: btw I found a typo in a function name: check_if_already_in_zacillo [15:57:33] also ensure_db_not_in_zacillo [15:58:47] want to send a MR? [15:59:09] Yep, will do ... just noting it here whilst reading the code :) [16:00:38] Looking for the issue as I only get one row for this query: select instance from masters where section = "test-s4"; [16:01:55] Ah, it is looking for instances not sections [16:07:02] The help shows hostname/FQDN yet in the table it is the not an FQDN, is that expected? [16:16:46] sorry, where exactly? [16:22:40] sudo cookbook sre.mysql.clone --help [16:23:59] It finds 3 rows for db1176, zero for db1176.eqiad.wmnet [16:25:48] yes it looks for instances (because a host can have 0, 1 or more instances) and check_if_already_in_zacillo should be moved to use the API instead [16:34:31] Yes, but it is looking for an FQDN that the code produced and the records in the masters table are for the hostname [16:35:03] I will update the row for test-s4 in that table and see if it changes the outcome of the cookbook [16:40:53] It did not