[08:57:50] Emperor: o/ we are testing the machinetranslation's credetials, but s3cmd keeps returning me errors (signature don't match etc..). I recall that when I created the ml read only account, I had to run something like https://phabricator.wikimedia.org/T311628#8050691 [08:58:04] just to confirm, was it done for the machinetranslation account? [09:15:04] I doubt it (I tend to only do the account creation and leave everything after that to the user) [09:15:26] elukey: ^--- [09:18:17] Emperor: oook! Mind if I do it? [09:25:46] please do, bonus points for !logging it so the phab item (and future-us) knows :) [09:27:45] definitely [10:13:42] Emperor: sorry getting back, there is something that may not be right for the machinetranslation account.. In puppet I see machinetranslation:prod, meanwhile in private I see machinetranslation [10:14:17] is it right like this or should it be machinetranslation_prod in private? [11:39:12] Amir1: Manuel is ooto - can I go ahead with the DC master flips in codfw for the schema change? [11:39:30] which section? [11:40:28] the 3 remaining section https://phabricator.wikimedia.org/T391056 aka s1 s4 s8 - I would start with s8 if that works [11:42:08] make sure the candidate master is rebooted so we don't need to switchover once again [11:42:19] otherwise it's good [11:42:54] you mean doing the security update after the flip? [11:43:10] ah you mean checking the current DC master before the flip, sure [11:44:15] speaking of which, if you have some time for https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1129904 it would also be good [11:44:18] yeah, have you done s8 in codfw? [11:44:57] all the replica hosts are done but I'll doublecheck each before the flips [11:45:03] that should wait IMHO. Switchovers are the most sensitive operation we do, any mistake can irreversibly corrupt the data [11:48:48] the cookbook is not changing the logic around the process but adding safety checks and removing manual copy-paste steps. It's pretty much generating the same steps and the manual process. [11:49:08] I've commented, and it looks like a really bad idea [11:49:35] federico3: still bugs can sneak in [11:49:58] some things should be ammended on db-switchover and tested thoroughly, some things shouldn't be on code at all [11:50:01] for anything else, It would have been much easier but this is different [11:50:31] Amir1: sure. We can find other ways to make the process as robust as possible [11:51:09] I also agree with Jaime, a lot of those should first be moved to the code, automated in different ways or removed. Then once the steps are simplified, we can do the cookbooking [11:52:04] most of those should be on the switchover script, others deployed throught puppet [11:52:22] hardcoding data is a really bad idea, it should be deployed through puppet [11:53:50] ok then maybe we can schedule a meeting to plan the next steps to make the process as safe as possible if we want [11:55:38] has manuel told you to focus on this cookbook? [11:55:52] or were there other priorities? [12:03:02] Manuel asked for other people to join the CR recently as the topic of doing DC master flips came up [12:03:46] please restore the change [12:07:39] uh? ok, do we want to rework this CR? I've been suggested to keep CRs short and close/open new ones for clarity [12:08:06] but I'm ok both ways [12:08:44] restbase is alerting on low storage https://alerts.wikimedia.org/?q=alertname%21%3DSystemdUnitFailed&q=team%3D~data-persistence&q=%40state%3Dactive [12:09:06] summary: Disk space restbase2024:9100:/srv/sda4 5.833% free [12:10:05] federico3: I think there could be value there, even if some stuff has to go somewhere else [12:10:25] e.g. the .sql file shouldn't be embebbed nor redundant [12:11:08] but I think you should focus for now on important stuff first [12:11:39] (e.g. the kernel updates is my understanding) [12:12:43] we can continue talking about the patch at another time [12:13:34] next FY starts soon, I assume a lot of refreshes coming in, that would be more important to make sure automation works, etc. [12:14:20] a -1 applies to a particular version, but it can be amended :-) [12:15:26] jynus: just to clarify: i did not mean to imply that we need to discuss the next steps today or in a short time as this is a long term task, but it seems it will require some deeper design discussion [12:16:13] I agree, but the patch can serve for that :-) [13:29:16] jynus at al: Amir might be busy, do you have a sec for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1163770 - it's just a "templated" CR for the DC master flip and should take only a second [13:29:49] sorry I was in a meeting [13:30:17] im not here [13:30:50] maint bot has created the patch already automatically, it's in the ticket: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1160101 [13:32:23] Amir1: thanks [13:33:15] Amir1: set you as reviewel for the latter [13:34:11] the automated patch is good but shouldn't be merged the puppet is disabled on masters, see the checklist [13:36:48] Amir1: I'm at the "FIXME" line :D [13:37:43] then merge and move forward [13:39:49] without review? [13:40:01] this one is fine [13:40:06] ok [13:40:42] it got created automatically so the chance of typo is 0 [14:08:47] I'm back [14:10:32] federico3: please depool the host [14:10:45] I know the weight it zero but before the work, you should depool it [14:10:54] (also make sure to down time it) [14:18:27] Amir1: which host and which step specifically? If you are referring to the work for auto schema or kernel update, the rolling restart is doing the depooling/repooling for db2161 [14:19:02] https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-ik7b2wpvsp4ikf2/ [14:19:18] I did not restart the host by myself... is it a bug in the rolling restart? [14:19:30] it was not depooled and was being restarted [14:20:53] the rolling restart script does the depooling by itself: https://phabricator.wikimedia.org/P78687 [14:22:46] the rolling restart script is also doing the downtiming [14:23:28] I don't get then why it was pooled while it was down [14:24:24] I just ran sudo python3 rolling_restart.py --run as usual [14:24:44] I can update the paste with the whole output [14:26:25] nah, I think there was a mismatch in the time [14:26:43] https://phabricator.wikimedia.org/P78688 [14:27:33] yet now the repooling seems odd [15:07:50] Amir1: uhm did you delete db2161's ipaddr from dbctl? [15:08:19] I didn't, I just depooled it [15:11:33] uf, something very wrong is going on [15:11:48] let me roll back [15:11:50] ? [15:11:51] there is a db2161 "10.192.16.216" [15:11:56] on eqiad.json [15:11:56] it's not in the right dc [15:12:14] it's not pooled in, but I'm removing the entry [15:12:33] done [15:13:10] indeed the entry in codfw has teh right ipaddr but it's pooled in with weight 0 [15:13:30] should I depool it, set weight and then pool it in slowly, yes? [15:13:45] I can still see it at https://noc.wikimedia.org/dbconfig/eqiad.json [15:13:56] maybe propagation time? [15:14:00] now it is gone [15:14:23] looking good now [15:35:11] now if I change the weight while depooled dbctl config diff shows nothing [15:35:29] I guess because it's depooled so the weight change would not affect wm? [15:37:29] Amir1: before pooling in I'm going to run ./drop_afl_patrolled_by_from_abuse_filter_T391056.py - good to go? [15:37:53] sure! [15:42:08] uhm nope... maybe the script expects the host to be pooled it ? [15:42:18] no, it works regardless [15:42:27] I explicitly made it like that [15:42:32] it wouldn't repool it though [15:42:38] it did not find the host or it's already updated ... looking... [15:46:41] federico3: you're depooling db1167 eqiad. I think it's start from scratch [15:46:50] than it'll just depool and repool all of s8 [15:46:55] yes I see it [15:47:11] only the ones that need the change tho [15:47:39] db1167 was a leftover I suppose? [15:48:49] nope, it depools and repools everything. That was the feature we asked to be implemented. You worked on it but not implemented yet [15:49:13] T299441 [15:49:15] T299441: Avoid depooling hosts if the schema change has been applied before - https://phabricator.wikimedia.org/T299441 [15:49:17] ah that one, meh [15:50:04] but the check scan must have found at least one host needing the schema change [15:50:46] anyhow we need to do the change anyways to close the task [15:53:23] now we have to wait for a day since each repool takes 45 minutes :D [15:53:39] meh, it's automated [15:53:41] Amir1: between schema changes and kernel upgrades and the fact that in the meantime hosts might be cloned, powered off for maintenance etc... we need some better scheduling [15:53:54] Amir1: uh? [15:54:44] ? [15:55:15] meh, it's automated <-- this [15:55:37] the script is in a screen [15:55:55] it'll take a day to finish and it'll be mostly useless but doesn't need human intervention [15:56:14] Amir1: but it runs after the check, no? [15:56:20] usually it's better to run it with --check, get the list of replicas that still need the change and set the replicas variable [15:56:38] federico3: no it doesn't. That's the whole point of T299441 [15:56:38] T299441: Avoid depooling hosts if the schema change has been applied before - https://phabricator.wikimedia.org/T299441 [15:57:03] it depools, runs the check and sees it's not needed and then repools but repools take 45 minutes [15:57:13] ah, but then we have a third case I wasn't aware of [15:57:14] multiply that by 30 replicas of s8 [15:57:24] I can stop and do a repool manually [15:57:31] yes please [15:57:35] that's much faster [15:58:03] but that does not solve the issue that for *one* host needing the schema change we would have to run the full depool cycle on all the others right now [15:58:19] ok, l'm repooling by hand and running --check again on both eqiad and codfw [15:58:20] As I said above [15:58:26] > it's better to run it with --check, get the list of replicas that still need the change and set the replicas variable [15:58:40] there is a replicas variable in the schema change file, it's set to None [15:59:17] so you can set the variable like replicas = ['db1234'] and then just run it again [16:00:04] it'd be easy to implement a replicas argument [16:00:20] exact same what we did with section [16:00:21] I remember you showed me that but I'm not following the full logic [16:01:52] aka we discussed changing the logic in the generator by passing "check" etc... but instead if we were to have teh script run a check run as in --check, then populate its own replicas variable, then run the change would that be equivalent? [16:03:53] it's different than doing check before depool and repool [16:04:08] when you run the script with --check, it just checks and reports [16:04:14] it's like dry run [16:04:28] you can use that to inform which hosts you need to run the schema change on [16:04:42] then you override the variable in the second run (without --check) [16:05:07] so it wouldn't depool hosts that you haven't specified [16:05:26] I'm aware, my point is [16:05:39] the difference in the logic is that if check gets run before depool, it's still run on everything [16:05:48] this one doesn't even try [16:07:26] if the script was tweaked to first run the same as --check, then use the output to populate its own "replicas" variable internally and then call the same function as --run ... [16:08:02] ...it would effectively depool only a subset of hosts as needed? [16:08:35] yes, but it'd be much better to just do the check before depool, that makes the logic more encapsulated [16:08:37] (and I'm aware it would walk through the generator twice first in the check then in the run) [16:09:02] you're basically describing the plan function [16:09:10] which is practically the same thing [16:14:08] well the planning function is not triggering the run. my point is that if we know and *trust* the combination of --check followed by --run we have a low hanging fruit [16:15:53] (my though was to implement the "proper" "just is time" check after testing the planning a bit) [16:16:21] I think that's for later not right now :) [16:16:47] the "proper" implementation? yes