[08:55:24] Amir1: Is MW ready to have ms3 as a section in dbctl even if not used? [11:19:59] mildly concerned to have another report https://phabricator.wikimedia.org/T387340#10589479 of MW losing a race with itself about writing an object resulting in MW thinking the object is gone when in fact it is present on swift. [11:20:26] Not sure if this is a user trying some new workflow that is exposing an old race, or if something has changed in MW that is now losing this race more often [11:26:36] PROBLEM - MariaDB sustained replica lag on x1 on db1220 is CRITICAL: 12.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1220&var-port=9104 [11:28:36] PROBLEM - MariaDB sustained replica lag on x1 on db1220 is CRITICAL: 10.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1220&var-port=9104 [11:31:36] RECOVERY - MariaDB sustained replica lag on x1 on db1220 is OK: (C)10 ge (W)5 ge 1 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1220&var-port=9104 [11:52:14] Interesting observation that our older Config-J systems have special firmware on the controllers - https://phabricator.wikimedia.org/T384003#10588536 [12:15:52] Amir1: I am ready to start working on ms2 but that means that x2 will stop having any standby replica [12:15:58] as ms3 is done [12:16:07] So if there's a x2 host failure there is no host to failover ot [12:16:09] *to [12:16:25] if that happens, maybe we can force push the changes? :D [12:16:40] XD [12:16:45] maybe I should do it monday [12:16:46] just in case [12:17:07] yeah, let's do it on Monday, we can't use it until Monday anyway [12:55:54] Emperor: does that potentially mean that the other controller we want to try may not solve the problem? [12:56:32] this extra information does not provide a lot of encouragement... [13:02:09] kwakuofori: no idea, sorry :( [13:02:51] e.lukey is looking to get to the point of trying a controller reset (but they need a newer storcli to attempt that), which may or may not help [13:04:12] hmm... ok, I get we just wait and see [13:06:20] I think so, yes. [13:06:58] A quote for the controller is in the works, at least [16:05:55] @elukey I left a question for you on https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1084813 - should I flag the comment as "resolved" for you to see it? [16:07:25] federico3: o/ nono I get an email if you answer, usually we leave to the person that originally created the comment to resolve (unless it is something trivial like "typo" etc..) [16:18:06] federico3: tried to reply, for your confirm_on_failure + retry I don't have a solid answer, I'll try to document myself [16:18:33] in any case, I'll try to help speeding up the code patches (whatever I can, for the more complicated things we'll need to defer to Riccardo) [16:19:25] elukey: I'm doing really minor cleanups- the bulk of the code is there [16:20:25] (I'm also e2e testing the script with running a real clone process) [16:21:10] federico3: I am aware that you are picking up previous work, but I wouldn't define it minor cleanup, some code is being added and for better of worse you are the owner of it now :D [16:21:50] and I am totally aware that this is tested etc.., but I am worried about other folks using it without your context and ending up with terse stacktraces to debug [16:22:13] it is also fine to leave things as they are with minor error msgs sumups etc.. [16:22:16] nothing really major [16:22:21] and we refine as we go [16:23:39] I was referring to the last few commits triggering the linters; I just have to fix the docstrings and I don't expect pushing any substantial change to this CR so that it can be merged [16:24:23] ah right okok