[04:47:16] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:01:52] I'm setting up that host ^ [07:49:20] FIRING: [2x] SystemdUnitFailed: prometheus-mysqld-exporter.service on db2226:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:53] ^I am setting that one up too [08:02:16] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:56] <_joe_> I just went through https://phabricator.wikimedia.org/T382947 (jeez you people write waay too much :D) [11:18:29] mar*stegui: for later- Correction: actually, 50% of dbprovs and backup sources are already upgraded, as new dbprovs were setup with bookworm [11:18:30] <_joe_> and I'm not sure I get what is the problem you have with having dbstore in dbctl if that's the case [11:18:53] <_joe_> marostegui / Amir1 asking you two :) [11:19:54] _joe_: lunch break / will be around soon, so may respond later [11:20:26] _joe_: Essentially we don't want to have non production hosts there, especially if they are multi-instance which is the case. Also, what Amir1 said there...having hosts even with weight 0 there doesn't mean MW won't send traffic eventually [11:20:55] <_joe_> ok [11:21:24] <_joe_> so basically you don't want that stuff around, and ben doesn't want to have to manually commit stuff and deploy when a server goes down [11:21:43] _joe_: There is nothing to commit if they go down [11:21:48] _joe_: There is no replacement or anything if they go down [11:21:58] There is only one instance per section [11:22:18] I am not up to date, but I wonder if it would make sense to plan for non-mw-production dbctl setup (for misc, analytics, etc)- probably in the future [11:22:31] jynus: Maybe confctl as we do with wikireplicas [11:23:02] basically unifying how dynamic config is handled but having separate realms, to some extent [11:24:08] although it could be hard unifiying beyond the stack, as it is handled by different teams and they may want different interfaces [11:24:32] sorry, I always think too high level and that may not be helpful for immediate solutions [11:24:38] <_joe_> jynus: that's one of the possibilities I was pondering [11:25:03] <_joe_> marostegui: ok yeah sounds reasonable [11:25:15] for example, at some point having haproxies from misc controlled by etcd [11:25:54] and try to have a more or less integrated vision of configuration, even if each has its differences [11:26:25] same thing would apply for backup sources [11:26:25] _joe_: yeah, I suggested going with conftool once there is actually stuff to depool/repool [11:26:49] <_joe_> that seems fair enough [11:27:58] <_joe_> in terms of special-casing, the thing to use is the SERVERGROUP env variable [11:28:29] <_joe_> and the ClusterConfig class [11:28:46] <_joe_> we might want to add a trait called "dumps" to that class [11:29:04] yeah, and we can remove it from dbctl once we are sure it is all good [11:29:16] yup, I even mentioned your refactor as useful way to use [11:29:31] <_joe_> do you think we need a meeting? [11:29:51] <_joe_> the phab ping-pong of infinite messages seems to be going nowhere [11:30:14] _joe_: I think it has been sorted that we will go for MW (per the last answer from Ben) [11:31:18] <_joe_> I read that as "for now" [11:31:44] <_joe_> which in light of the fact there's only one instance per section assumes a different meaning [11:31:55] <_joe_> I'll comment on a couple questions from ben [11:32:10] <_joe_> which are more mw-specific and mw-running-in-prod specific [11:32:24] _joe_: For now, seems to be for a few years, based on his previous answers about the future of the current architecture [11:32:36] <_joe_> ack yeah [11:33:09] <_joe_> I am a bit wary of the proposal of using SRV records for databases. [11:33:47] I believe that is what wikireplicas are using (or maybe in the past they were) [11:34:10] <_joe_> I wouldn't consider anything wikireplicas do as an example [11:34:32] <_joe_> but I'll think about that proposal a bit more, if we contain it to dumps it's a reasonable approach I think [11:35:53] _joe_: I am not saying it is an example, what I am saying is that it may come from there :) [11:38:06] wikireplicas mgmt is using conftool, I don't think they use srv tools [11:38:38] Before conftool I believe they did [11:51:22] <_joe_> I'll try to write a few patches this afternoon to test how feasible it is, assuming oncall doesn't take my attention away [11:58:21] marostegui: I did some digging for SRV records in wikireplicas, but I didn't found any either in the current or current-1 setup [11:58:25] what I did found is https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/MediaWiki_replicas#Database_setup [11:58:59] so it looks like SRV were/are used for dbstore hosts [11:59:05] but I'm not familiar with those [11:59:43] s/did found/did find/, geez [12:05:27] yeah, that's only for dbstore in stat machines [14:00:03] marostegui: just to double check, is there anything specific about rebooting depooled pc hosts I should be aware? [14:00:20] Amir1: just upgrade them please [14:00:34] yeah, that's on my list too [14:00:38] Amir1: if they are masters, you'll need to silence the other master too [14:00:43] as they are multimaster [14:01:01] noted [14:01:34] since we basically need to depool in pairs, I do the reboot and upgrade in pair too [14:01:42] sounds good yeah [14:01:58] Update the kernel reboot task please [14:03:25] I will [14:03:37] actually one of them has a dangling replica [14:03:54] That needs to be silenced too [14:03:59] Which I guess it is a spare for that section [14:04:07] yeah [14:04:21] I'd leave it to you to wipe it and set it as pc6 [14:05:14] is the task there? [14:05:31] I will create it once I'm done with pc5 [14:05:42] Just to be clear, you want pc6 with one host in codfw and one in eqiad right? [14:06:13] yup, basically get rid of concept of spare and make it similar to other sections [14:06:19] *other pc sections [14:06:43] yeah, but we have two spares at the moment in eqiad [14:06:47] do you want pc6 and pc7? [14:06:50] yup [14:06:57] great [14:07:30] I will try to organize also the existing pc because they are a bit of a mess now with hostnames and such [14:07:36] just to try to have everything aligned is possible [14:07:54] if we can find some lonely idle db host somewhere, I'm also game for pc8 too since now we can just depool stuff if they go down (and made a lot of graceful degradation changes) [14:08:09] That is going to be more difficult [14:08:44] We could get the spares from x2 [14:08:46] I had eyes on one of replicas of x2 [14:08:47] Once x2 is fixed :) [14:09:03] Yeah, right now we have two replicas per DC [14:09:21] yeah, good news is that x2 is also SqlBagOStuff so most of things we did here applies to x2 too [14:09:35] Amir1: I guess we can probably steal one per DC [14:09:39] And still have one spare left [14:09:53] sobanski: Do you know when/if we'll need the spare replica for the phab upgrade testing? [14:09:55] so we can set up x4 and split the load, in case of issues or maint, we just depool the whole section [14:10:13] Amir1: So let's go for pc5 and pc6 for now and then we can see what to do with x2 [14:10:17] I don't [14:10:31] Don't let it be a blocker for anything else [14:10:37] sobanski: Got it :) [14:10:40] Thank you! [14:11:01] Amir1: Let me reorganize pc a bit, set up pc5 and pc6 and then we can talk about x2 [14:11:04] and pc8 [14:11:08] Sounds good? [14:11:48] yeah, I'm totally fine for waiting a bit. My point is that now it's more of a ring, we don't need to worry about maint and is much more horizontally scalable, let's say if we end up in an emergency we can probably even scale it to pc10 too [14:12:22] Yeah, I like that we can depool a whole section and do whatever we need there [14:12:27] It is a very good improvement [14:12:39] Btw, we should document the fact that depooling a section actually means setting both hosts to 0 [14:12:47] Instead of dbctl instance pc10xx depool [14:12:52] I will do that too. In my list of todo list [14:13:11] but first, reboot and upgrade [14:14:10] Yep [14:14:24] And you owe me the dbctl way of preventing depooling all sections! [14:14:42] I will :D [14:15:54] I'm also thinking of actually automating the reboot via my script. It would just go all pc sections and reboot them automatically [14:16:19] Yeah [14:16:21] And upgrade [14:24:20] I can actually use the upgrade cookbook [14:24:21] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/mysql/upgrade.py [14:24:54] ah yes, cause it does a reboot [14:46:08] Used the upgrade for the other one. So much simpler. Now everything is fine. Pooling back pc5 [14:51:38] Latency back to the previous state (reusing values) [14:51:44] https://usercontent.irccloud-cdn.com/file/skLMweAK/grafik.png [17:44:21] FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:59:44] we are at least in the process of draining that node (currently blocked on the dead disk elsewhere - T382874), this is still T372207, I'll extend the silence again [17:59:45] T382874: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874 [17:59:45] T372207: Disk (sdc) failed on ms-be1058 - https://phabricator.wikimedia.org/T372207