[06:56:04] jynus: I am merging your change about tmps and ramfs [06:57:41] ok [07:13:11] I am going to switch x1 master [07:25:52] mmm dbctl is broken? [07:26:01] 'x1' does not match any of the regexes: 'DEFAULT', '^es[67]$', '^s[124-8]$' [07:26:06] what? [07:26:53] We've not made any changes to that as far as I know [07:27:58] I wonder if this could have broken that? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153723/2/modules/profile/files/conftool/json-schema/mediawiki-config/dbconfig.schema [07:28:02] _joe_: Are you around? [07:28:26] <_joe_> yes [07:28:49] Can you think if the above patch could make x1 not being detected by dbctl [07:29:07] Cause I don't see how it could happen but on the other hand i've not seen any changes to dbctl [07:29:13] <_joe_> not really but I lack a lot of context [07:29:35] <_joe_> that patch looks wrong in any case [07:29:39] marostegui: the previous regexes in readOnlyBySection wouldn't match x1 either [07:30:12] This change did go thru https://phabricator.wikimedia.org/P78760 [07:30:17] and as joe says, the commit doesn't do what the subject suggests it should [07:30:19] So it looks like it is failing to set RO for x1 [07:30:24] <_joe_> marostegui: what command did you launch? [07:30:46] <_joe_> marostegui: are you sure it was possible to set x1 to ro before? [07:30:47] Sorry guys, the patch is the last change made to dbctl, not related to any of this switchover when I am trying to set x1 to RO [07:30:49] <_joe_> via dbctl? [07:30:57] Yes [07:31:12] <_joe_> ok, I'm 99% sure the problem is that patch [07:31:23] <_joe_> and / or the one just before [07:31:47] You are making me doubt now about x1 RO [07:32:34] Yeah, it was possible, just checked previous tasks [07:32:36] x1 could be set as read only as once I left it read only by accident [07:32:49] yeah [07:33:03] through dbctl, I cannot say, as it was before that existed [07:34:19] It was possible yes [07:34:50] <_joe_> I mean through dbctl [07:35:00] <_joe_> I doubt it was ever possible tbh [07:35:01] Yes [07:35:05] <_joe_> given the code I see [07:35:29] <_joe_> what I mean is adding it to readOnlyBySection shouldn't be possible [07:35:55] <_joe_> marostegui: can you please clarify exactly what dbctl command you used and what didn't work in a paste? [07:36:29] sudo dbctl --scope eqiad section x1 ro "Maintenance until 06:15 UTC - T397612" [07:36:29] sudo dbctl --scope codfw section x1 ro "Maintenance until 06:15 UTC - T397612" [07:36:29] sudo dbctl config commit -m "Set x1 eqiad as read-only for maintenance - T397612" [07:36:29] T397612: Switchover x1 master (db1237 -> db1220) - https://phabricator.wikimedia.org/T397612 [07:36:49] <_joe_> yeah I don't think that could ever work in dbctl with the current schema [07:37:05] https://phabricator.wikimedia.org/P78761 [07:37:49] I am sure it was working [07:37:55] But I am doubting now [07:38:17] <_joe_> you can easily verify if that patch was the issue [07:39:26] <_joe_> change /etc/conftool/json-schema/mediawiki-config/dbconfig.schema to the old regex [07:39:31] <_joe_> it should still reject your change [07:39:45] I mean I can revert [07:39:47] <_joe_> but if it doesn't, then we have another mystery at our hands [07:40:36] I don't see how that change would make x1 fail [07:40:40] That's why I am puzzled [07:42:51] <_joe_> I don't think it ever worked tbh [07:43:03] For now I've run sudo dbctl --scope eqiad section x1 rw and sudo dbctl --scope codfw section x1 rw [07:43:04] <_joe_> I'm not even sure mw has support for x1 read-only that way [07:43:10] To "unlock" the breakage [07:44:23] looking back at the history of dbconfig.schema I'm not sure it's obviously ever matched x* in readOnlyBySection [07:44:39] [the first version has DEFAULT and ^s[1-8]$] [07:44:44] <_joe_> Emperor: you're saying it was never enabled in your opinion? [07:44:48] <_joe_> I agree. [07:45:20] that's my understanding, BICBW! [07:45:29] <_joe_> but I'd suggest marostegui to wait for Amir1 and figure this out between DBAs - again I'm not sure x1 read only can be set via readOnlyBySection [07:46:02] wilco [07:46:03] thanks [07:46:15] <_joe_> yeah it's a pity you're the only DBA here atm [07:53:18] I am pretty sure x1 switchover were not this messy [08:44:25] \o [08:44:59] I am about to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1164235 which changes the user used by the MT services on k8s to access Thanos-Swift [08:50:49] Emperor: ^ [08:51:44] I will switchover m2 master [08:59:24] I'm about to go to a meeting, we can add x1 explicitly but AFAIK, mediawiki is not well suited to handle it [08:59:52] it ignores ro status in lb factory conf for external clusters (x1 is considered external cluster) [09:00:15] so even if it used to work in dbctl, it would have been noop in mw [09:01:51] Amir1: But I don't recall having dbctl complain about that with x1 [09:02:03] Also I don't recall creating the switchover patches with something that doesn't work in x1 [09:02:12] I am so confused now [09:02:23] klausman: ack, thanks for letting me know. [09:02:53] there are three things here, let me grab a patch [09:04:24] okay my meeting got canceled [09:04:27] klausman: NB - the two buckets in the AUTH_mlserve account (wmf-ml-models and wmf-ml-models+segments) still only have a read ACL to mlserve:ro [09:04:36] the first thing is this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153723/2/modules/profile/files/conftool/json-schema/mediawiki-config/dbconfig.schema [09:04:48] maybe that got changed but probably it used to allow x1 [09:05:16] the thing is that even if it allowed x1, because the flavour is "external", it would been ignored by dbctl [09:05:35] i.e. it never worked, maybe it let you do it but it never worked from what I'm seeing [09:05:43] klausman: I think you may want to adjust those ACLs to also allow access from the machinetranslation:ro user once you've created it (and rolling-restarted the proxies) before you start actually trying to use it [09:05:44] Emperor: yes, I think Luca wanted to wait with that until these patches are all merged and puppet has distributed them, then do the swift post -r bit [09:06:00] klausman: OK, cool, that sounds sensible, just wanted to check it was on the plan :) [09:06:10] Plus of course the proxy restart (I'll do that in a bit, and let everyone here know) [09:06:28] https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/80/diffs [09:08:24] On top of it, compare https://github.com/wikimedia/mediawiki/blob/master/includes/libs/rdbms/lbfactory/LBFactoryMulti.php#L188 and https://github.com/wikimedia/mediawiki/blob/master/includes/libs/rdbms/lbfactory/LBFactoryMulti.php#L236 Mediawiki never reads "RO by section" config for external clusters [09:13:49] Amir1: I will be back in a sec, I am switching m2 master [09:13:59] have fun! [09:18:17] Amir1: So we have two options (if we cannot make x1 RO via MW) either change the switcohver template to remove those two commands or make dbctl at least support x1 (just add it to the regex) so dbctl isn't broken, even if it is ignored by MW [09:18:36] I prefer option #2, so all the templates are the same and the switchovers are executed the same way [09:18:38] What do you think? [09:18:47] sounds good to me [09:35:02] btw https://phabricator.wikimedia.org/T392784#10901217 [09:41:12] Amir1: https://gerrit.wikimedia.org/r/1166776 [09:41:16] I think that'd be all? [09:42:43] yes but why would I +1 when I can use it as leverage to get my stuff reviewed too? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155210 or https://gerrit.wikimedia.org/r/c/operations/puppet/+/1163846 [09:43:47] hahaha 1 line vs those! [09:43:50] I will review them ok [09:45:28] <3 [09:52:20] Confirmed that my patch fixed the issue found in the morning [10:02:29] marostegui: I have little context around https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155210 , want me to review it as well? [10:06:24] I'll run the roll-restart cookbook for the Swift-Thanos proxies in a few mins unless someone stops me [10:08:53] klausman: please go ahead [10:09:27] Aye, cap'n [10:09:59] ooh, I get promoted XD [10:13:25] cookbook run complete, no errors [10:17:25] FIRING: SystemdUnitFailed: swift-account-stats_machinetranslation:prod.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:54] o.O [10:19:01] Hm, this is going to be side-effect of deleting the old account. [10:24:46] yeah, plus something is still amiss. [10:25:20] The two layers of names really tie my brain in a knot [10:26:12] yeah, we're rather over-stretching an auth system that isn't intended for production deployments [10:27:25] RESOLVED: SystemdUnitFailed: swift-account-stats_machinetranslation:prod.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:32] I've removed the machinetranslation:prod service & timer, so that alert should clear soon. [10:27:36] Ah, as it did while I was typing [10:28:32] I think I found the missing bit (missed an _ro suffix in the private repo). I'll have to re-bounce the Swift FEs once more [10:43:54] federico3: that's a change from Amir :) [10:52:59] what did I break this time [10:53:40] nothing, the alert above was not part of this conversation thread [10:56:18] federico3: we are centeralizing the information about tables into one giant yaml called table catalog. That has a field for each table called "visibility" and basically using that, we can produce the list of private tables (or fully public tables https://gerrit.wikimedia.org/r/c/operations/puppet/+/1163846) [11:38:51] FYI, with this 7000 tables will be dropped from s3 [11:38:52] https://phabricator.wikimedia.org/T395928#10978560 [11:44:05] niiiice [12:22:38] federico3: Can you try to finish the RO external store hosts reboots this week? [12:24:04] marostegui: yes, and we should also plan a bit the remaining flips [12:24:21] sure, do you have tasks for them so we can organize? [12:24:49] other that the maint task itself? [12:25:40] (BTW I'm updating the host list in the task and db1237 in not being detected, but if we add it later it will show up again) [12:27:11] yeah, i actually rebooted it for kernel upgrade but it never came back : https://phabricator.wikimedia.org/T398794 [13:06:46] Could I get a +1 for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1166822 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1166823 please? Take the now-drained thanos backends from the rings, and then out of hiera so I can decommission them. [13:08:35] Emperor: I can have a look [13:13:55] (and done.) [13:20:27] thanks :) [15:08:11] I have to go to doctor, after that,I'll run the clean up of private tables [16:22:47] marostegui: cdanis: I think we akk 3 are saying the same, the qk is the wrong place to "fix" this [16:22:51] *all [16:22:54] yes [16:22:58] I didn't mean that [16:23:12] but I was confused because when we went back to healthy code [16:23:24] usually the qk takes care of leftovers quickly [16:23:34] and that is the part that worries me, it used to do that [16:23:47] it won't protect about another overload- that's too low on the stack ofc [16:24:31] sorry that I fixate so much on dbs but that's the only part I know, but wrote other questions for other layers too [16:25:29] So the qk for me is only interesting to increase the degraded state, not the first minutes 0:-) [16:25:40] *to be responsible for [16:26:05] the other parts are more important, but I can only check this part [16:26:40] (I did some test replicating and the qk worked well, and I didn't see extreme overload at the end) [16:37:34] apologies if I framed the query killer as being the main source of grief during that, I was a little frazzled [19:31:57] PROBLEM - MariaDB sustained replica lag on s8 on db1154 is CRITICAL: 15.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13318 [19:33:57] RECOVERY - MariaDB sustained replica lag on s8 on db1154 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13318 [20:47:38] PROBLEM - MariaDB sustained replica lag on s1 on db1154 is CRITICAL: 144 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13311 [20:53:38] RECOVERY - MariaDB sustained replica lag on s1 on db1154 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13311 [20:57:33] That was me running the clean up of private data scripts [20:58:23] so we should not get any "private data found" emails but even if we do, nothing will be exposed publicly so just ping me and I clean it up