[06:00:34] looks like s5 went allright [07:27:28] good morning, I see that the switchdc prepare cookbook was run tonight and AFAICT it went ok beside T375144 Do you have any feedback or things you would like modified? [07:27:29] T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144 [09:00:22] volans: I have 2 immediate requests that should be easy [09:00:46] sure, either fire away or open task, whatever is easier [09:01:09] can the script/is it reasonable to log per section on phabricator? [09:01:21] it would be typical that we do 1 section, then we check [09:01:26] sure, I thought you migth ask for it :) [09:01:33] and it would be useful to see it on phab too [09:01:40] the other is the order [09:01:47] I think it does s1 s2 s3 [09:01:47] inverted? [09:02:10] not sure if inverted, as es or s8 are I would say second tier [09:02:10] it does in this order [09:02:11] https://doc.wikimedia.org/spicerack/master/api/spicerack.mysql_legacy.html#spicerack.mysql_legacy.CORE_SECTIONS [09:02:39] does it keep the configured order? [09:02:56] I guess so [09:03:01] yes, it goes through the list, so if I invert it it would got es7, es6.... [09:04:01] I would do something like s6 s5 s2 s7 s3 s8 s4 s1 [09:04:51] let me get the list from the order that mysql upgrades happen [09:05:07] the reasoning is: "how hard it is to recover from backups if something wen't wrong" [09:05:14] makes sense [09:05:19] and with s1 being between 1/3 and 1/2 of the visits [09:05:24] it is very impacting [09:05:25] if we think is the best way we can change the constant directly in spicerack [09:05:40] so anything that loops through sections will get them in the safest order [09:05:55] that's ok, if you can do like a sample patch to where it is [09:06:04] and then we can agree the right order [09:06:43] with the dbas [09:06:51] those are more usability patches [09:06:58] but I think they will want it [09:07:18] the important one is to mitigate the ROW replication issues [09:07:35] T375144 [09:07:36] T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144 [09:08:21] (note the other also applies to the switchover, if something breaks, we prefer it breaking on s6 first, as it is trivial to fix) [09:08:28] *order [09:10:30] jynus: starting point for discussion https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1074111 [09:10:47] volans: thanks [09:11:19] that's exactly what I wanted, the order to be discussed [09:11:56] we may actually in the future use something else to read those [09:12:09] as we have a valid_sections global puppet config [09:13:03] and I am sure the logging will take some more time, but it shouldn't be too hard (?) [09:13:14] nah, few minutes [09:13:23] the ROW will be more complex because I am not sure how to fix it logically [09:13:51] as what failed there wasn't a script logic, as I understood it [09:13:54] but our practices [09:15:23] the issue coming from ROW not having a replace command [09:15:39] so it translates it to either insert or update row [09:16:01] yeah [09:16:10] most likely update [09:16:26] but the row doesn't exist on the other replica set [09:16:50] so either we keep it clean and either delete from the original replica set or insert a fake one on the primary [09:17:13] or we keep row ones uncleaned all the time [09:18:17] the logic is not too crazy "if secondary master uses row and primary master doesn't have a row, insert it" but it doesn't feel right [09:19:37] arnaudb: I have a few asks to check before switchover too, regarding checking codfw to be prepared [09:20:56] volans: I don't know if you have any immediate thought about the row stuff, I would like to know if I am missing something obvious [09:21:06] sure [09:25:09] jynus: do you have handy the query that broke the replication? I want to make sure I fully understand the problem [09:25:14] volans: there is one last thing [09:25:29] I need to rewrite it in rust? :D [09:25:36] I think the automation should I have caught the issue [09:25:46] early [09:26:13] can you see if the automation checks that the primary master (eqiad for us) was replication yes/yes ? [09:26:19] maybe we missed that check [09:26:24] or it happened too early [09:26:36] volans: we're shifting to cobol [09:26:38] but ideally it should have caught there was an error [09:27:10] volans: I am filling in T375144 [09:27:11] T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144 [09:27:12] what do you mean by "replication yes/yes" ? [09:27:29] replication sql running: Yes replication io running: Yes [09:27:39] the wording may be slightly different [09:27:42] in show slave status [09:27:48] this before starting? [09:27:53] nope, after setup [09:28:07] I wonder if the 1 second wait was enough [09:28:13] as heartbeat happens every second [09:28:22] so it may have passed 1+ second since start [09:28:31] or we just didn't check that [09:29:05] we could check that for N times with a small sleep [09:29:32] please check how it is the logic now [09:29:33] after slave start we check that the values are the ones in this dict: [09:29:36] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/databases/prepare.py#242 [09:29:44] see the 'expected' variable [09:29:56] but that's for master_to [09:30:00] yeah, so it likely failed in 1+ second [09:30:06] se it wen't under the radar [09:30:10] no we chefck only the target [09:30:14] so codfw [09:30:15] or [09:30:19] it happened much later [09:30:25] after the start of heartbeat [09:30:31] so we may just need a later check [09:30:33] at the end [09:30:46] after pt-heartbeat-wikimedia starts at the end [09:30:47] no no it checks master_to only, not master_from, you're right [09:30:54] so I just add a check for that too [09:30:56] I see [09:31:06] so one last check at the end should be ok [09:31:12] adding it [09:31:27] as we worried a lot about the secondary [09:31:35] because the primary didn't broke much [09:31:41] but it would have alerted us earlier [09:31:47] so that is for detection [09:32:19] yeah, sorry about that, I don't think it was explicitely mentioned in the doc, but makes total sense to check that too [09:32:41] that's ok, if we knew it was going to fail, I had said it! [09:32:53] this was a surprise for everybody! [09:33:54] just make sure there is 1+ second between pt-heartbeat and the last check [09:34:06] k [09:34:15] which I think it happens anywhay because of the start slave on the secondary [09:34:45] your check was there [09:35:00] but it failed on the following step, so we need to check again [09:43:17] volans: I have the timing- change master happened at 2024-09-18 21:27:14 [09:43:33] and error at 2024-09-18 21:27:17 [09:44:10] so the original check wouldn't catch it [09:44:30] should we add a sleep before the last checks? [09:44:38] or repeat them N times [09:44:58] nah, we just need a last check after the last start and check [09:45:10] ok [09:45:24] as it depends on when the other do their things [09:55:22] I filled in T375144 with all information [09:55:23] T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144 [09:55:32] if you want to reference it for patches [09:55:37] <3 [09:56:11] arnaudb: quick 5 1:1 to get you up to date and then we work individually? [09:56:16] 5min [09:56:39] sure, I'll send you an url! [09:57:45] url sent :) hop-in whenever [10:05:34] FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:20:54] That's still T372207 ; I'll extend the silence until after my vacation; we were expecting replacement hardware this Q, but it seems to have got delayed rather [10:20:55] T372207: Disk (sdc) failed on ms-be1058 - https://phabricator.wikimedia.org/T372207 [10:25:30] patches sent, I've added both of you [10:26:03] and for the sections order also am.ir and ma.nuel so anyone can chime in [12:00:57] volans: please be patient, as testing may take some time and we have more important tasks atm blocking the switchover + incident handling [12:01:03] thanks for the work, volans [12:01:17] it is very appreciated you are so open to improvements [12:04:23] sure, no worries, just wanted to get them out before I forget the details :) [13:02:21] I have a patch to add a new fileset to bacula, if anyone would like to review it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074156 [13:03:58] Overall size to bacula shouldn't change much, as we will be migrating postgresql databases from an-db1001 to this new system. [13:10:12] thanks for checking, normally I just +1, but it is nice to be aware for capacity planning [13:11:46] in this case I actually have some comments that may improve stuff [13:12:00] Great! All input welcome. [13:16:19] you have my +1, but I commented something to try to setup for the future [13:17:28] but you won't see me blocking adding more backups :-D [13:36:35] Cheers. [14:22:56] at your will, feel free to resolve T371351 and the improvement can go to the post-incident ticket or elsewhere. [14:22:57] T371351: Automate the pre/post switchover tasks related to databases - https://phabricator.wikimedia.org/T371351 [15:02:19] ^that was for volans [15:15:28] sure thx [15:18:40] btullis: db1208 is warning that there is no backups, is it ready for an initial backup run already? [16:20:54] jynus: yes, but there will probably just be a single zero length file. [16:22:01] ok, then I will let it run on it own tonight [16:22:31] (just FYI, it may alert if it gets 0 bytes backups) [16:26:24] OK, I may be able to get some content there before tonight's run. Thanks. [16:41:35] have a nice day, see you on monday [16:43:06] o/ [16:45:39] Empero.r: are you aware of ms-be1058? I went to look at the puppet failure (as seen in my inbox), and the root cause seems to be a disk failure (sdc). But the disk failure seems to be from Aug 10, and the puppet failure recent. [16:45:47] I didn't find an open ticket. [17:12:31] urandom: see scroll, where I silenced the alert earlier. The DC team closed out the ticket for the dead disk (I think they have objectives around closing tickets); there are no spares and the system is due to be retired (was going to be this Q, but that looks less likely now). [18:31:58] auh, the silence expired... [18:32:44] I was most concerned with why we'd only just get the notification