[05:14:57] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on pc1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:10] ^will be decommissioned today [05:20:22] federico3: btw do you want me to re-run the decomm cookbook? [07:08:27] @marostegui I can run it if you prefer otherwise yes [07:09:24] federico3: Sure, up to you, my question was more if you had time to fix the issue I saw yesterday [07:11:30] federico3: The task is https://phabricator.wikimedia.org/T427270 and the puppet patch is https://gerrit.wikimedia.org/r/1294123, let me know if you want to do ir or you want me to do it, either is fine [07:13:04] @marostegui yes I did update the cookbook [07:13:16] federico3: Ok, I will run it now then! [07:14:05] ok [07:24:07] federico3: all fine, I will make a comment on the patch with an "UX" improvement, but it went all good [07:27:22] thanks! [07:28:17] thank you, very nice work [08:39:45] there has been an low level alert "NRPE CHECK: Check whether ferm is active by checking the default input chain" from pc2022 for a few days, is anybody working on it? [08:40:09] Amir1: ^ [08:40:19] moritzm: ^ related to the migration to nfttables? [08:40:26] I think there was some work on that host for that [08:43:14] that's the pilot DB host on nftables, this error is harmless and will be fixed via https://gerrit.wikimedia.org/r/c/operations/puppet/+/1283620 [10:32:22] cezmunsta: the slides will have more data: https://www.slideshare.net/slideshow/backing-up-wikipedia-databases/178453520 [10:32:28] or context [10:36:23] marostegui: not urgent, but do you think there will be soon a host to provision or usage of test-s4 we can do so cezmunsta can test a full recovery? [10:36:55] jynus: Yes, cezmunsta will start with this https://phabricator.wikimedia.org/T407942 soonish [10:37:15] So even any of the ones at the bottom that will not be replaced, and simply decommissioned, can be used if needed too [10:37:28] nice [10:38:19] or just recover for reals on a new ones [10:39:06] marostegui: are they already physically onsite? [10:39:33] jynus: yes, they are installed [10:39:59] so it could be db1265 too, as I set backup sources from backups anyway [10:40:43] anything you like from that list yeah [10:41:02] just communicating to you I have encouraged him to do a recovery from offline backups so you give him the space for practical testing [10:41:19] Sure! thanks! [10:46:29] jynus: noted the extra slides + restore test :) [10:47:24] I've sent an invite for bacula recovery, and invited SRE calendar [10:47:36] let me know if the time works for you [10:47:53] Yep, I have just accepted the invite, thanks [10:58:33] anyone worknig with db2189? [10:58:50] nope [11:01:20] mmm [11:01:25] db2189 went down yesterday [11:01:34] federico3: 11:25 fceratto@cumin1003: dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P92978 and previous config saved to /var/cache/conftool/dbconfig/20260526-112513-fceratto.json [11:01:39] was that part of the rack maintenance? [11:02:14] B8 [11:02:18] shouldn't [11:02:25] I think it is part of the reboots [11:02:43] ah, so stuck on reboot or something like that? [11:02:54] federico3: can you check if/why db2189 was repooled if it still down? [11:03:16] And also why the script didn't stop? (Don't know if this was already implemented or not) [11:03:37] yeah, it is not new, 23h down, so it must be an expiration [11:03:38] looking [11:03:53] yeah [11:03:57] but it was repooled [11:04:54] :-( [11:06:14] for now https://gerrit.wikimedia.org/r/c/operations/puppet/+/1294249/ [11:06:20] it was depooled by auto schema, rebooted, did the usual start of mysql, replica and the heartbeat check and then repooled [11:06:33] so it crashed after it? [11:06:40] https://phabricator.wikimedia.org/P93228 [11:06:54] let's check hw [11:07:01] checking, one sec [11:07:31] federico3: the script won't remove the downtime? [11:08:22] auto schema? maybe not, we should tweak it [11:09:04] The host is full of Description: An OEM diagnostic event occurred. [11:09:14] it are you seeing prometheus metrics for it? [11:09:15] But nothing else, probably worth asking DCOps to give it a look - I will create a task [11:09:57] it crashed by itself while in the silence window of the reboot? [11:10:16] that's what we are discussing here [11:10:27] Date/Time: 05/26/2026 11:51:03 [11:10:30] how long is the downtime? if it is 24 hours or 48 hours, probably yes [11:10:38] that's the HW entry [11:10:49] as it crashed 23 hours ago [11:10:53] The end of the repool is at 2026-05-26 11:35:14.438243 dbctl instance db2189 pool -p 100 [11:10:59] yep, here's the last datapoint in prometheus https://grafana.wikimedia.org/goto/ffnbktk0lsohsa?orgId=1 [11:11:20] one more reason to add the removal of the silence as we discussed in the past days [11:12:06] yep [11:12:09] https://phabricator.wikimedia.org/T427376 [11:14:26] I'm opening a task for the silence removal [11:14:49] make it a child or a parent of the one cezmunsta created about the automatic downtime [11:26:06] unrelated, I was checking 2021 media statistics, and I've seen total media original bytes has tripled in 5 years (!) [11:26:22] while databases have only grown a 50% [11:27:12] code repositories have grown a 25x [12:40:46] marostegui: FYI, I deleted binlogs older than a year again, you'll see blips like this in a lot of hosts https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db1212&var-datasource=000000026&var-cluster=mysql&viewPanel=panel-28&from=now-3h&to=now&timezone=utc [12:40:59] Amir1: thanks [12:52:14] federico3: https://phabricator.wikimedia.org/T427388 that probably needs also the ops-codfw tag no? [12:52:24] (nice to see the automatic task creation working!) [12:53:59] @marostegui the initial plan was to just open the task for the DBA tag [12:54:42] (and indeed 2026-05-27 12:30:53 [s1 db2212] PANIC, host didn't come back online ... it failed to SSH) [12:55:15] (and still failing now) [12:55:23] federico3: yep, and that's great, what I am saying is that we should add it now, otherwise dcops won't see it [12:55:44] ah I thought you mean the script should tag it [13:02:38] I don't think so for now [13:02:52] As we may want to make sure first that they are not false positives etc [13:02:59] yep [13:03:03] federico3: how long is the downtime? [13:03:09] because we probabyl want to extend it too [13:03:22] to avoid disturbing oncall [13:03:35] (plus I would not auto-open a task to another team without asking them if if they are happy with it first) [13:03:55] yep [13:04:12] the default in autoschema, better bump it up [13:04:50] yeah, give it till monday to be sure [13:06:10] updated [13:06:13] thanks [15:04:44] I have noticed this getting close a few times, where Icinga is outdated when compared to "reality" ... this time it broke the run https://phabricator.wikimedia.org/P93266 [15:07:33] cezmunsta: wait_for_optimal fails while the host is actually ok? [15:08:00] Based upon Grafana, yes... just checking the host itself [15:08:27] raised: Not all services are recovered: db1178:MariaDB sustained replica lag on s8 [15:08:40] maybe it's not using the best metric? [15:08:40] SQL_Remaining_Delay: NULL [15:10:19] cezmunsta: some codebases might still be using the lag from show replica status instead of pt-heartbeat, see https://phabricator.wikimedia.org/T367278 [15:10:48] albeit usually both metrics should be ok once the lag goes down [15:11:30] I have extended downtime by 24h ... given that it was due to expire in a short while [15:13:19] the host seem to have recovered as expected in https://grafana.wikimedia.org/goto/ffnc6g1awomioe?orgId=1 [15:14:16] federico3: yes, it seemed fine despite Icinga being too slow to recognise that [15:15:37] So, I presume that failing at that point should only be repooling and removal of downtime left to do? [15:16:41] It looks like Icinga shows 2026-05-27 15:10:44 as the recovery time? [15:17:35] maybe the cookbook is too aggressive with timeouts? In both major-upgrade and upgrade we first do get_db_instance(spicerack.mysql(), fqdn).wait_for_replication() and then icinga_hosts(host.hosts).wait_for_optimal() [15:18:11] yes, I would first remove the downtime then repool [15:18:30] ack will do [15:19:40] there is no timeout https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.IcingaHosts.wait_for_optimal 🤷 if it happens often we could do a retry [15:21:52] It reached 12/15 on the other instance that was in the run, plus I have seen it reach 15/15 on at least one occasion. [15:23:04] I will note this on the related task shortly [16:07:32] I am stupid, I spent more time that I can admit debugging an issue, because a lot of files didn't match on metadata vs filesystem [16:07:42] can you guess what was the issue? [16:08:07] the hash and the name of the files didn't match [16:08:17] it was encryption [16:08:38] 🤦 [16:38:50] jynus: s/It's always DNS/\0 ... unless it's encryption/ :P [16:38:59] ha ha