[09:30:16] Amir1: arnaudb I am taking over db1125 & db2230 [09:30:33] ack, please do so (cc volans ) [09:30:37] the first is already broken, so I don't see how I could break it more [09:30:44] :-) [09:30:47] aha [09:30:57] (but I will try) [09:31:05] (everything is up and running) [09:31:18] ah, ok [09:31:39] I thought it was still bad becase T374774 [09:31:41] T374774: db1125 (test-s4) is broken - https://phabricator.wikimedia.org/T374774 [09:32:29] you can take over if you want, this is no trouble anyway ^^ the remaining thing to do would be to fix the SQL schema [09:32:38] nah, no worries [09:32:46] (I've lower the priority in my todo bc its not blocking anymore for testing stuff) [09:32:47] I will just recover both from backups afterwards [09:32:52] ack :) [09:33:34] https://phabricator.wikimedia.org/T374774#10148652 cf this comment for the remaining issue [09:34:00] it's ok, I will just reload its data from backup anyway [09:34:15] as my intention is to break those 2 hosts as much as I can [09:34:50] seems fun :D [09:39:44] :D [09:40:14] jynus: if you're ok with the spicerack changes in the meanwhile I can merge and release a new version [09:42:24] yes please [09:42:33] I was about to vote +1 but I lost the change [09:43:15] https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1073274 [09:43:58] thx :) [09:44:45] just to be extra redundant, I am going to write into test-s4, so data there won't be the same as s4 [09:45:28] ack [09:45:49] ack [10:10:49] volans: what's a good way to run a local copy of https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1059052 before deploy on production? [10:11:26] test-cookbook? [10:11:28] https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Test_before_merging [10:11:45] ah, interesting [10:11:46] but you need to wait a sec the spicerack release for the rename of the gtid method [10:12:15] test-cookbook allows also for local modifications in the temporary checkout [10:12:27] I see [10:16:55] apparently orchestrator is not very inteligent and requires a heartbeat cleanup after undoing circular replication [10:17:17] if we decide the new names for the cookbooks I can already update the patch with the final names [10:17:21] that was already on the notes [10:17:29] and if you want also the changes to test it on test-s4 [10:18:12] it's ok for now, I just want to run it very slowly, so I will just add some exits here and there [10:19:25] to make sure it doesn't continue in case it goes to the wrong set of hosts [10:19:54] I think it asks now before proceding anyway [10:20:54] yes it does [10:21:20] but I am in paranoid mode, so I will not trust anything or anyone :-D [10:23:48] so we can first run it ind dry-run mode with the patch for test-s4, then for real for test-s4 and then without the test-s4 patch in dry-run mode for all the sections, that makes sense [10:25:29] yep, yep, I am trying to see how test-cookbook works [10:25:59] how do I just do a checkout? [10:26:11] run the cookbook with -h/--help :D [10:28:57] nope, that didn't do it [10:29:22] I ran "test-cookbook -c 1059052 -h" [10:30:03] you have to run something [10:30:14] test-cookbook -c 1059052 sre.hosts.downtime -h [10:30:15] for example [10:30:27] doens't have to be the cookbook you want to test ;) [10:30:43] to fullfill the paranoia mode ;) [10:31:01] or, y'know, pick the reimage cookbook and live dangerously ;) [10:31:17] :D [10:31:24] with a swift host then [10:31:28] sure, reimage ms-be2019 is ok, Emperor ? [10:32:03] anyway we could add this use case and just checkout if no cookbook is passed or add a specific flag for it [10:32:38] it's ok, just please let me go slowly, this is something I will only have to go once to understand your mind [10:34:30] sooo what's the final names we want for those cookbooks? [10:34:31] :D [10:34:45] we will keep it as it is for now [10:35:02] as you want, no prob to change it [10:35:02] but the whole phylosofy needs dba feedback [10:35:35] as now I understand Amir1 not liking it- this is not adapted well to our workflow [10:35:56] not the testing, but the logic [10:36:22] I followed manuel's doc [10:36:35] yeah, no, that's ok [10:37:49] it is just you are making us work differently than what we do, from language to structure [10:38:21] and that's ok for a run, but I would like to see us more involved [10:39:46] volans: can I give you an example that is not important, but is very visible? [10:40:02] sure, I need more details/context to understand what you mean exactly [10:40:09] others that are more important are not *that easy* to see [10:40:51] on the documentation, manuel mentions this is the "enabling (of) circular replication" [10:40:52] as for involvment, the spicerack library improvements were done with your team, the cookbooks was me following manuel's doc using what we have in spicerack [10:41:11] you call it enable_cross_replication [10:41:23] you may thing, oh, that like a non-issue [10:41:31] and technically it is [10:42:00] but it creates me a huge mental pain, "you don't cross streams" :-D [10:42:19] again, that is naming and that is not important [10:42:33] but it is an example regarding phylosophy [10:42:58] we use cross in other context [10:43:50] sure, let's rename it [10:44:03] no need to touch it now [10:44:10] as I said, it is a silly example [10:44:21] of a larger philosophy [10:45:12] I'd prefer to get this working as it is asap and then collaborate better in the future [10:47:11] is run the entry point to the script? [10:48:19] run() is the entry point of every cookbook, because the two cookbooks share teh same general logic (run a bunch of stuff for all sections), that's handled in the __init__.py file in DatabaseRunnerBase's run() [10:48:43] cool, just confirming [10:48:44] while each cookbook implement its own run_on_section() [10:49:38] sorry, I didn't get that last part [10:50:26] if you look at DatabaseRunnerBase's run() at one point it calls self.run_on_section() [10:50:46] that's implemented in prepare and finalize, as those cookbook classes inherit from DatabaseRunnerBase [10:55:03] I got it now what you meant, I just didn't understood if at first [10:55:12] no prob [10:56:23] check cumin1002:my_home/cookbooks_testing/cookbooks-1059052/cookbooks/sre/switchdc/databases$ git diff [10:56:39] and I will run it in dry mode if it look fine to you [10:57:46] that would not work [10:57:58] oh [10:58:03] test-s4 is not part of the CORE_SECTIONS [10:58:12] I see [10:58:27] let me add it then [10:59:04] see https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1059052/9#message-05468de6f05f302f7eadfea66d9a5633302af344 [10:59:11] those are the 3 lines changes to make it work for test-s4 [10:59:20] I can send them as a new PS if that's easier [11:01:12] also let me finish the spicerack release of the calls to gtid will fail due to the rename [11:07:51] ok, spicerack released and deployed to the prod cumin hosts [11:07:56] so that bit is done [11:09:17] my intention is to run "sudo test-cookbook -c cookbooks --dry-run sre.switchdc.databases.prepare" [11:09:40] ugh, I typed somethign wrong there [11:10:02] sudo test-cookbook -c 1059052 --dry-run sre.switchdc.databases.prepare [11:10:15] asuming that will run my modified version [11:10:52] * volans checking git diff [11:11:05] it will detect local modifications and ask you [11:11:13] I will be removing learning wheels later on [11:11:19] I am very conservatibe atm [11:11:26] + self.section = 'test-s4' is not needed [11:11:31] I know [11:11:37] will remove it later [11:11:45] but it's giving you a potential fake information [11:11:47] :D [11:12:13] ok to proceed? [11:12:13] for section in ["test-s4"]: already ensures that and also it's just a name [11:12:22] yes the rest of the diff looks ok [11:12:29] let me do this beforehand [11:12:53] jynus: FYT I have to leave in ~15 minutes for ~1h [11:13:09] maybe less [11:14:12] no worries, I can take it from here [11:16:24] see, it gave me information [11:16:26] ? [11:16:32] cumin.backends.InvalidQueryError: Unable to find alias replacement for 'db-section-test-s4' in the configuration [11:16:52] will handle it all myself now that I know how you think :-D [11:17:37] where? [11:17:46] mysql.get_dbs("db1125*") doesn't sue aliases [11:18:06] ahhh the hosts to downtime [11:18:07] my bad [11:18:14] not really [11:18:19] downtime_hosts = self.remote.query(f"A:db-section-{section}").hosts [11:18:20] it works as intended [11:18:27] yeah that needs to be adjusted for test-s4 [11:18:30] go away [11:18:39] :D [11:18:40] I am not confortable [11:18:42] *now [11:18:52] one last question, will this log spam? [11:19:17] you mean SAL? [11:19:19] yes [11:19:33] not in dry-run of course, yes when running for real, unless you pass --no-sal-logging [11:19:36] or the test removes it? [11:19:48] good to know, thanks [11:19:55] all the cookbook logs are in your home [11:20:04] will use that as I will try to break the script :-D [11:20:07] ~/cookbooks_testing/logs [11:20:32] oh, so testing does not sal by default? [11:20:42] what do you mean? [11:21:04] dry-run never SAL, as it doesn't do any RW operation [11:21:08] if I do test-cookbook but not a dry run it will SAL, right? [11:21:12] yes [11:21:19] ok, that's all, thanks understood [11:21:26] if you do test-cookbook -c 12345 --no-sal-logging .... [11:21:28] it will not [11:21:33] thanks [11:22:11] my intention is basically now run it with all possible failures scenarios I can think [11:22:17] that's all [11:22:29] break stuff :D [11:22:57] AND collect info for a possible post-swtichover version [15:03:14] hello data-persistence friends - I should know this, but who is the point of contact for running pre/post-switchover setup/teardown this time around? I know there's a lot of automation work going on, but I don't know who will be driving :) [15:03:53] swfrench-wmf: that's a much more complicated question that you could think :D [15:03:56] we're working on that.... [15:03:58] stay tuned :D [15:04:18] sounds good, just keep me posted :) [15:04:32] and let me know if there's anything I can do to help, of course [15:04:54] "someone from data persistence" [15:04:57] at the moment [15:12:16] volans: I thought you say my comment, I am using the task to vomit all my notes [15:12:36] 🤢 [15:12:50] feel free to keep an eye, but no need to listen to everyone until I finish my checks [15:13:11] so we focus on the most important ones given the time constraints [15:14:02] sure sure, I just wanted to tell you why it behaves differently that you expected given the local patch [15:14:12] yes, that part was nice [16:00:37] * volans afk for ~20m [16:00:47] ack [16:00:58] topranks: is the maintenance still going on like expected? [16:01:43] arnaudb: yes, and we'll start with db2209 [16:02:04] will be a few mins later to start as a previous upgrade is still ongoing but we're almost done [16:02:13] ack, thanks for the heads up :) [16:23:42] * volans back [16:40:39] I belive we may be missing a heartbeat. qualifier or switch db command [16:41:15] but why the select worked and the delete didn't? [16:41:38] ah, the query is ok, but run_query is missing a parameter [16:48:32] nice catch! updated CR [16:48:59] I belive there may be another thing [16:49:06] but I am not seeing it right now [16:49:15] **Failed to enable GTID on db2230.codfw.wmnet, current value: Slave_Pos** [16:49:29] however, Slave_pos is what we want [16:49:36] something related to capitalization? [16:49:54] I thought I had fixed that already, checking [16:49:58] that doesn't happen with No/no? [16:50:12] oh, maybe I am not using the latest version [16:50:27] not today, was an earlier fix [16:50:32] from a week ago or more [16:51:06] do you remember where it is, I cannot find where that is handled. if you can give me a patch or a file/lineno [16:52:15] maybe that was on spicerack [16:52:24] no was for the disable gtid [16:52:31] prepare line 131 "Using_Gtid": MasterUseGTID.NO.value.capitalize() [16:52:52] the values in mariadb docs are lowecase but they are returned capitalized from a select [16:53:14] this seems different I'm checking the logs [16:53:20] yeah [16:53:30] it is that, but you probably fixed it for No/no [16:53:35] but slave_pos is a pain [16:53:36] :-D [16:53:43] because it is Slave_Pos as returned [16:54:06] not a great decision, I would say, from mariadb [16:54:46] maybe better to lower everything ? [16:54:48] I picked the values from https://mariadb.com/kb/en/change-master-to/#master_use_gtid [16:54:56] yeah, I trust you [16:55:00] should I use the capitalized/camelcase version instead? [16:55:09] if the lowercase ones are never used... [16:55:26] ofc "CHANGE MASTER TO MASTER_USE_GTID=slave_pos" works [16:55:33] the problem is it returns upper case [16:55:36] but then if you select it gives you Slave_Pos :/ [16:55:52] yeah let's compare lowercase I guess. [16:56:10] Using_Gtid: Slave_Pos for show slave status [16:56:28] specially as that is the typical thing they transparently change between versions :-D [16:56:39] so nice of them [16:56:54] welcome to the db team! [16:57:28] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1059052/10..11/cookbooks/sre/switchdc/databases/finalize.py should fix it [16:57:40] other than those 2 things, I think we should be good to go [16:58:31] I *think* I've addressed all the bugs/blockers, there are few nice to have changes you said to wait, but I'd be happy to fix them too if you want tomorrow morning. Up to you [16:58:49] there is only 1 thing, which is the wait time [16:59:30] I checked and it almost waiting 30 seconds, that may be too much, and while it is following manuel literal instruction, probably not what he intended in spirit [17:00:02] as that is changing constants, it is not a bit deal to tune [17:00:16] which 30s? [17:00:51] 10 seconds + 1 second + check + command execution end up being almost 20-something seconds of replication delay [17:01:02] ahhh got it [17:01:05] between stop and start, and that may be too much [17:01:45] sure, we can tune it as you want, that's easy :) [17:01:55] but that is not a logic rewrite, so shouln't be too hard [17:02:08] let me retry the latest changes and I'd say to merge afterwards [17:02:19] but yeah, it is a bit late, maybe better tomorrow [17:02:51] sure make sense. If you have a suggestion for the wait time, how you want to ttune it lmk [17:03:04] if it's just reducing the 10s sleep or in anothe rway [17:03:15] yeah, probably that [17:03:32] that's something I would like to have Amir's buy in [17:06:05] sure, that take a second to change, so we can do anytime. I'd say, if you have time let's do some final testing tomorrow with the latest PS (plus your local changes) and then we should be able to merge. [17:06:34] yes, I will leave it here, otherwise I won't be too fresh [17:06:46] but I am happy with the state of things [17:06:59] I thought review was going to take more time [17:07:10] I've never tried but I think that the local changes could be retained with a stash + checkout (to remove local changes) + test-cookbook -c NNNNN sre.hosts.downtime -h (to refresh the local checkout with the latest PS) + stash pop [17:07:36] thanks a lot this was super helpful! [17:07:37] good point. although I saved the patch also just in case [17:07:55] well, I didn't do it to make you happy, if something I was bothering you :-D [17:08:20] thanks for the help and responsiveness, it is appreciated [17:09:28] my pleasure [17:09:31] tty tomorrow then