[08:28:12] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:36] alerts.wm.org has a bunch more things about pc1013, maybe it needs downtiming if you're working on it? [08:43:43] arnaudb: ^-- [08:44:43] Emperor: I just sent an email for this, I was perplex as this host has notifications disabled, will look into it rn [08:44:58] sorry for the noise! [08:50:32] NP [08:59:01] jynus: o/ backup1012 is ready, lemme know if anything looks weird or not when you use it [09:11:56] thanks [09:49:58] elukey: do you happen to know if backup2012 will have a similar issue or no news of that, or it was just unblocked because work on 1012? [09:53:15] jynus: I wasn't aware of backup2012, if it is the same model as 1012 then it is highly probable that we'll see this issue again (but it depends on what firmware is deployed). In any case, unblocking the task if the issue re-appears it will be a matter of half an hour, we have a good procedure now [09:53:31] thanks [09:53:39] yes, it was bought at the same time [10:00:47] sneaky issue, for some reason those nodes were shipped with an old firmware [10:01:00] either they were sitting somewhere at supermicro since 2022 :D [10:01:20] or something is wrong on their building pipeline [10:05:10] can I flag you something, as I consider you now the Supermicro expert? [10:08:02] Icinga set up a "Dell PowerEdge RAID Controller" alert, which of course doesn't work [10:10:35] ahahah I am not sure if it is a badge-of-honor or a curse :D [10:11:41] Should I create a ticket for dcops/IF about that? [10:11:47] jokes aside, good point, lemme check [10:12:35] let me delete the facts cache just in case [10:13:54] ack [10:14:14] I didn't see it for other supermicro nodes yet, but backup1012 may be the first of its kind [10:14:45] so RAID status monitoring is a big deal, this being a backup host [10:16:11] this is something that I am still ignorant about, never provisioned supermicros with RADI [10:16:14] *RAID [10:18:16] * Emperor elides their standard rant about the evils of hardware RAID ;-) [10:50:36] sigh, swift never VACUUMs its container databases, so a few of them have become really big, and thus we have disk issues [10:51:06] (before VACUUM - 7.1G, after VACUUM 3.2G) [10:54:06] weird because the controller, a broadcom 1000:10e2 is in our list of supported controllers, but doesn't seem to be recognized [13:43:12] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2149:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:20] arnaudb: I think that's yours? ^ [13:44:44] yep, checking [13:46:57] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2149:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:50] I see you are downgrading hosts, is it already confirmed that the issue is a mariadb bug? Do you have more info [13:55:41] this is a work in progress, but we are quite sure that our issue comes after upgrading to .19 [13:55:57] jynus: I haven't fully checked all values but the one or two values I checked are now correct [14:02:24] thanks [14:02:43] I wanted to be sure to avoid that [14:02:46] version [14:10:51] we reverted to .17 fwiw [14:45:16] Matthew will love this T377853 [14:45:17] T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853 [17:05:32] With no need for realtime or urgency, looking for a DBA to deploy some GRANT changes. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1080781 [17:12:46] I take a look [17:13:50] mutante: it looks good but these files are mostly for decoration. I need to deploy the changes separately, I can do it later today, will you be around? [17:18:49] Amir1: Yea, I'm aware it needs separate deployment. I am around, guess depends when. There is nothing to check though except that creating db dumps stops failing. [17:19:43] also, thank you, I would normally not ask this kind of thing on realtime chat, I only did because I wasn't sure who is currently the right reviewer and had not gotten response on gerrit yet. [17:21:38] no worries! That's why I'm here [21:14:02] PROBLEM - MariaDB sustained replica lag on s5 on db2213 is CRITICAL: 37.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2213&var-port=9104 [21:15:02] RECOVERY - MariaDB sustained replica lag on s5 on db2213 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2213&var-port=9104 [21:46:32] that's a s5 dumper [22:10:30] mutante: I deployed the changes, can you check if everything is working fine? :D