[07:20:13] There is an uncommitted DNS change in Netbox that is related to the mgmt interface of db1179. arnaudb is this related to some WIP? [07:20:33] checking [07:21:11] server has been handled on the end of july https://sal.toolforge.org/production?p=0&q=db1179&d= [07:21:56] its a x1 server which is pooled, I guess the DNS thing is an overlook? [07:26:10] what do you mean by overlook? [07:26:23] the diff is: [07:26:26] -wmf4972 1H IN A 10.65.0.216 [07:26:29] +wmf5164 1H IN A 10.65.0.216 [07:26:48] oversight*** [07:27:00] french brain not awake yet [07:28:10] doesn't look like anything to meβ„’ [07:31:40] ah the asset tag was fixed yesterday, I'll merge the changes to iunblock others [07:31:43] https://netbox.wikimedia.org/extras/changelog/191050/ [08:30:20] volans: so the delay it is too much- I timed and it was 58 seconds under ideal circumstances [08:33:58] in prepare between stop replication and start replication? [08:34:15] yes, there are a few steps in between [08:35:08] how much is that when doing it manually? :) [08:35:12] one second that I run the script again in the opposite direction for validation of the latest changes [08:35:44] and I suggest a better approach, closer to what manuel probably does [08:37:47] I can tail the logs and see which are the steps taking more time [08:38:41] it will be removing the 10 second pause (because we already pause later) and do the stop heartbeat earlier [08:40:17] makes sense [08:40:43] you can follow the reversion here: https://orchestrator.wikimedia.org/web/cluster/alias/test-s4 [08:41:18] and tail -f prepare.log :D [08:41:30] I already checked the fixes work well the other way [08:41:40] great [08:43:12] [test-s4] MASTER_FROM db2230.codfw.wmnet @@read_only=0 [08:43:35] oh, I executed the wrong script [08:43:40] my fault [08:43:43] which is good [08:44:00] but can you see why the naming is confusing to me? [08:45:32] sure [08:45:38] stop slave 08:44:58,993 [08:45:56] start slave 08:45:47,453 [08:45:57] 51 seconds or so [08:47:43] the thing is the checker already waits, which is the point of those 10 seconds [08:48:19] jynus: the reason is just that I'm stupid :D [08:48:26] ? [08:48:29] why? [08:48:45] the logic if fine, but it is too literal [08:48:57] so we do the whole 11 attempts to cehck if the master_status is stable [08:49:04] and then count if stable > 5 [08:49:10] and each one has a sleep of 3s [08:49:11] ah, so there is a logic problem [08:49:29] I was only checking that the value was correct [08:49:33] so we do 10*3s sleeps [08:49:41] anyway, even if nothing is moving, that is stupid [08:49:53] check wait_master_to_position, there are to TBD [08:49:58] let me finalize the return to normal and we see how we iterate [08:50:01] sure [08:50:20] that way I validate the last patch- we can even merge it because it is not harmful [08:50:33] and then optimize the wait [08:51:38] one sec, I need to do a "manual switchover" (Change the read only values) [08:54:34] ok, the rest looks ok [08:57:17] that's my +1 https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1059052/comments/009c30e4_c2bbc1f5 [08:58:42] sorry for the delay, got a phone call, [08:58:45] arnaudb: we should schedule some time to do some final test and run for real with you and Amir1 [08:58:52] sure :) [08:59:36] db maintenance finishes today, right? [09:00:05] yep, we have a slot after 16:00 UTC maintenance and before Amir1 enables circular repl [09:00:21] (I have to run a switchover in that timeslot as well) [09:00:30] jynus: great then I'll merge and send a patch for the delays [09:00:50] ok to merge, but not to run it! [09:01:18] sure sure :D [09:42:43] jynus: just to clarify, should I remove completely the 10s hard sleep or reduce it? [09:43:17] that should be moved to the other check [09:43:45] I think if we do the stopping of pt heartbeat before [09:44:00] and do a 5 second check we should be ok [09:44:26] the other check I'm making it exit when 3 master status checks with 3s sleeps in between are identical, so basically 9s sleeps plus 3 MASTER STATUS queries [09:44:36] yeah, that would work [09:44:51] so in this case we can remove the 10s hard sleep [09:44:53] I think those are the 10 second wait [09:44:56] ok [09:45:12] and I'm moving the stop heartbeat right before the stop slave [09:45:37] thanks, send review and we can test again extensivelly [09:46:09] should I also rename make the s/cross/circular/ rename? [09:46:51] up to you, I would that changed, but that is not a blocker for today [09:46:56] *like to see [10:01:00] I'd like someone here to cross check this wikireplica view https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073430 [10:02:43] jynus: here it is: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1073750 [10:12:36] I was checking it, logic seems sane, let me test it now [10:13:33] jynus: I just sent a new PS based on arnaud comment [10:13:46] I changed some logging from debug to info as I gather it might be useful to see it when running it [10:14:23] yeah, I had a similar comment before to that [10:14:24] I would not touch it anymore until your feedback :) [10:14:38] when waiting more than 1 second, the operator gets sweaty [10:14:51] so he was on point on asking for that [10:15:21] sure sure, we don't want your heart rate to rise for this [10:15:42] it's supposed to help reduce the stress of those operations ;) [10:16:36] ofc you need yesterday's local modification for test-s4 (repeating it just for paranoia) [10:17:52] yep, just reading slowly [10:27:01] the dry run waits 3 seconds only, as expected [10:27:10] will now do a real run on test-s4 only [10:28:50] ack [10:30:41] silly question, having moved pt-heartbeat stop right before the stop slave might cause any alert to fire? [10:31:14] not really [10:32:10] stop slave will, as alerts under normal circumstances will depend on the primary id [10:32:41] do we need to dowmtime additional stuff? [10:32:57] i'd say no if things work normally [10:33:09] (within those 10 seconds) [10:33:19] and we may want alert if they get outside of that [10:33:33] an later refinement may restart replication automatically [10:33:42] if it doesn't catch up within retries [10:33:57] but we can leave the current logic like this for now (let the human take over) [10:34:26] I think we don't want to hide too many bad status [10:34:34] indeed I agree [10:34:38] how long was the wait this time? [10:34:52] I was checking now, seemed less but want to get the data [10:35:33] looks like 10:30:21,670 - 10:30:40,177 to me [10:35:35] 2024-09-18 10:30:26,698 jynus 1776228 [INFO] MASTER_TO db2230.codfw.wmnet STOP SLAVE. [10:35:38] if I got it right from the logs [10:36:34] 2024-09-18 10:30:40,177 jynus 1776228 [INFO] MASTER_TO db2230.codfw.wmnet START SLAVE. [10:37:04] sorry I got the previous one I think [10:37:44] so ~15s [10:37:57] yeah, I think that is within parameters [10:38:30] but much less than the 50 before [10:40:47] I am going to run now the finalize in the oposite direction, which is how one would revert the action (I need to edit the file for testing) [10:42:53] looking good [10:43:13] \o/ [10:43:53] but again the naming- to revert the change, I need to run finalize in the opposite direction - not trivial at all [10:44:39] in my head it's "let's finalize the switch from A->B" but if that not obvious for you we can change it as you prefer. [10:44:46] nope [10:45:05] in this case I was running, to revert, let's finalize the switch from B->A [10:45:11] but we didn't even start it! [10:45:22] I agree that having "revert prepare A->B" equal to "finalize switch from B->A" is counterintuitive [10:45:35] * volans open to suggestions [10:46:03] I dont like that, and we should change to setup/stop-circular-replication [10:46:11] but lets not move thing anymore [10:46:31] functionality wise I think it is ok now [10:47:02] that's the critical part [10:47:31] as long as you are understanding that we are compromising in some aspects to make it ready for the time constraints [10:47:40] I think it is ok [10:48:18] I would like to get rid of mysql_legacy and implement a pure mysql interface, there is too much overhead with remote execution [10:48:42] but certainly not on this iteration :-D [10:49:16] yeah we can fix naming and usability after the switchover [10:49:54] volans: I say merge, and then we can show the current status to others so they get confident and raise further concerns [10:50:02] SGTM [10:51:00] they should see it running ok and failing so they are ready to act when a problem arises [10:51:43] as for mysql_legacy has the advantage of having a cheap way to do parallelization that might be useful to do simple things over a large number of hosts and an easy way to perform also shell commands on hosts (managing systemctl units and such). For anything more complex on the mysql side I totally agree that the mysql module (with a mysql client) should be used (and improved) [10:52:45] the current code can easily be run in DRY-RUN mode against prod for prepare. Finalize a slightly trickier, the full run could be run only after the master switch, right now can be run in inverted mode but it will bail out because everything is already done. [10:53:01] *the full dry-run run [10:53:18] the funny thing is in our case, parallelization is not needed, it is the critical section of maintenance (stopped replication) that we would like to see faster [10:53:45] which would happen by running the mysql library directly (connction and run is very cheap) [10:54:59] let's see what Amir1 says, maybe he prefers 2 seconds of wait, given the execution overhead [10:55:06] but that should be trivial to change [10:55:12] yep [10:55:51] Sorry I got food poisoning yesterday [10:55:56] Let me see [10:56:03] I just found a small bug, I'll try to fix it, it's just in the phab message [10:56:33] (says cookbooks.sre.switchdc.databases in both cases and not the actual cookbook) [10:57:22] Amir1: if you are ok, let's discuss the status at some point today [10:57:41] I think it is good, and we are just tuning timings [11:01:25] I'm about to hop on a train for eight hours. But I will be able to look at stuff there (but no meeting πŸ˜•) does that work? [11:01:50] well, what would it take for you to get confident with the cookbook? [11:02:09] will you run it? who if not? when? [11:02:45] I am planning to run the changes later today [11:02:55] (Once reached berlin) [11:03:05] so are you confident with the cookbook? [11:03:16] I need to check [11:03:22] Will do in the train [11:03:26] that's what I am trying to help you with [11:03:37] do you want me to hand you test-s4 so you can test there? [11:04:31] I am here for anything you know, from explanation, to shadowing, to anything you need [11:04:43] *anything you want [11:06:09] hello? [11:09:48] one-liner fix: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1073757 [11:15:17] Thanks. I will check [11:16:34] Gonna try with test-s4 [11:17:14] If you give me the command line. So I make sure I don't mess up anything [11:17:15] thanks, all yours [11:17:18] yep [11:17:34] I am currently writing on test-s4:test.test to simulate load [11:17:50] let me know if you want me to stop that at any moment [11:19:23] (Connection here is spotty) [11:19:37] let me get you the command line [11:21:11] you also need the local patch or use jaime's checkout [11:21:21] yeah, either log in as me on cumin1002 [11:21:43] or apply this patch: [11:22:45] https://phabricator.wikimedia.org/P69252 [11:22:58] we should have that on a proper patch so it is easier to test [11:23:34] test-cookbook -c 1073750 --no-sal-logging sre.switchdc.databases.prepare -t T374972 eqiad codfw [11:23:39] T374972: Output test logs of production testing of the pre switchover tasks related to databases - https://phabricator.wikimedia.org/T374972 [11:23:42] and then I run this ^ [11:25:39] or you can run the current deployed cookbook in dry mode (cookbook --dry-mode sre.switchdc.databases.prepare -t T374972 eqiad codfw), but that will not run on test-s4 [11:26:01] it asks for confirmation on every section [11:26:10] Sounds good. I will try the dry one for now [11:26:28] and if a step fails, it asks you to abort, retry or wait [11:26:38] so one can do manual fixes [11:27:12] I have like switched test-s4 in and out of circular replication several times already, including trying to break it [11:27:28] jynus, Amir1: I've sent https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1073762 for testing on test-s4 so we can use test-cookbook -c 1073762 [11:28:12] that has a mistake, volans, and that it assumes codfw is the primary datacenter [11:28:31] that's from your paste :D [11:28:37] which is one of the tests I have done, but I belive currently eqiad is the primary [11:28:46] ack let me invert [11:28:59] yeah, well depends on which direction one wants to test :-D [11:29:10] ok let me do it properly then [11:30:47] Amir1: the only open question is that it waited 10 seconds as instructed by manuel, but with some overhead and checks it goes up to 15-16 seconds. So we could reduce it further [11:31:20] {done} [11:31:32] I think 15s is still good [11:31:46] Doing it manually makes it take way longer [11:31:58] we could do 3 checks every 2 seconds which woul put us in around 9-12 seconds [11:32:13] No matter what we do, it'll shorten time when run automatically [11:32:14] I just wanted you to be aware :-D [11:32:36] Sure [11:32:43] I think we are good for now [11:32:53] but I have battle tested it against test-s4 quite significantly (we found some bugs and corrected them) [11:33:04] the other difference with manual [11:33:13] That's amazing. Thank you! [11:33:15] is that it chooses the masters based on puppet, not orchestrator [11:33:41] I will review the dry run against all sources before the run [11:33:55] to make sure puppet and replication match [11:34:15] once we have dbctl merged (after the switchover) we could get them easily via dbctl [11:34:18] I prefer orch API (it's open in cumin hosts) but personal preference 😁 [11:34:39] yeah, I discussed with volans that I would like to change it further after this iteration [11:34:43] also this is the smae method used by the actual switchdc cookbooks [11:34:55] but on the other side, that is battle tested for the actual switchover [11:35:20] so it seemeed to be the wise option for now, iterate later on the next switch [11:35:55] (there are many improvement to have, but at some point we had to be conservative) [11:36:20] I think that is all from my side, but let me know if there is something else (questions, etc) [11:36:33] and of course if you ask me, I will be around when you run it [11:40:35] Amir1: one last thing, swfrench-wmf asked yesterday about a responsible to coordinate that maintenace preciselly [11:41:07] I will tell it is you, and the rest of the DBAs as a backup or whatever, if that is ok (or you can tell him :-D) [11:46:11] sounds good to me [11:46:29] I will do the work in his timezone anyway [11:46:34] perfect [11:57:17] forgot to say- despite checking masters from puppet, it checks the topology is right (they are masters and replicating in the right direction) so it should be safe in that regard [14:37:35] * swfrench-wmf has now been told, albeit passively :) [14:38:01] <3 [14:38:14] was making time for your timezone [16:38:14] Amir1: T375050 over, you're good to go [16:38:16] T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050 [16:38:42] arnaudb: <3 [16:39:06] jynus I just saw your email on my way out, lets have that chat tomorrow! [16:39:20] good [16:39:26] awesome [16:39:26] I will start with s5 anyway [16:39:32] ack [16:39:53] signing off for the day, reachable on Signal if needed! [16:40:39] hopefully not :D [17:28:43] :D