[06:44:06] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on pc2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:51:40] ^ that can be ignored [08:13:06] Is there a process defined for Alertmanager to silence alert rules ahead of time? [08:13:38] there's a cookbook for downtiming hosts. [08:13:54] (sre.hosts.downtime) [08:18:11] Emperor: thanks, I wasn't sure if that did Alertmanager too. Having a read of the code now [08:23:26] cezmunsta: It should do, yes, although it is on experience not 100% [08:36:47] incident overnight has wedged a bunch of stats etc processes in codfw, I'm clearing them out now [08:40:00] is db2218 being down a maintenance event or so? [08:41:04] not that I know [08:42:58] it seems to be down and is paging (see operations), is there anything i could do to help? [08:43:00] bjsen looking [08:43:04] want to move to private? [08:43:09] sounds good [08:48:48] FIRING: [14x] MysqlReplicationLagPtHeartbeat: MySQL instance db2159:9104 has too large replication lag (11m 16s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [08:52:24] PROBLEM - MariaDB sustained replica lag on s7 on db2218 is CRITICAL: 788 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2218&var-port=9104 [08:54:06] FIRING: [3x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db2218:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:24] RECOVERY - MariaDB sustained replica lag on s7 on db2218 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2218&var-port=9104 [09:01:48] Maybe I am not looking in the right place, maybe there aren't any to find... does anything check for host/DB uptime? [09:03:10] cezmunsta: other than the console itself: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=now-3h&to=now&timezone=utc&var-server=db2218&var-datasource=000000026&var-cluster=mysql&refresh=5m and https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-3h&to=now&timezone=utc&var-job=$__all&var-server=db2218&var-port=9104&refresh=1m [09:04:00] marostegui: sorry, I meant s/check/check and alert/ [09:04:44] not an alert per se on uptime, but we do alert on ping [09:07:04] ack ... now found "[2026-05-15 08:38:37] HOST ALERT: db2218;DOWN;HARD;2;PING CRITICAL - Packet loss = 100%" in the alert history [09:39:01] Could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1287826 please, changing two more ms-be nodes to new-style storage? [09:39:24] (not sure how many of us are actually here today) [09:47:04] Emperor: reviewed, LGTM fwiw [09:56:53] thanks :) [09:58:00] cezmunsta: FWIW, local practice is generally the reviewer does +1 not +2, and then the submitter does the +2 [this matters less for puppet, but for some of our repos, setting +2 sets a bunch of automation off] [10:00:57] tbh I was looking for a +1 and tried the code review button *thinking* that it might have a drop down ... lesson learned :) ... next lesson starting, how do I change in the UI? [10:03:13] hit "Reply", you get a set of "verified" / "code review" boxes - hit +1 on code review, add some text, hit Send [10:07:27] Yep, just found it whilst checking what was in Reply... noted for next time [11:09:06] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on pc2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:16] FIRING: [11x] DiskSpace: Disk space backup1004:9100:/srv/objectstorage 0.6583% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:53:57] Emperor: is swift @ codfw under maintenance? [11:54:24] Re: alerts, I silenced them for a month [11:56:54] jynus: I was reimaging a couple of backends, but they're out of the rings. why? [11:57:16] I do see some slighly-unhappy backends, I'll give them a kick [11:57:25] according to several metrics and alerts, no new images are being added to codfw [11:57:31] upload ratio shows 0 [11:57:52] plus "MediaWiki swift object counts site diffs" [11:57:56] jynus: metrics collection definitely sad overnight, but I thought I'd restarted all the necessary this morning. Maybe I'll just reboot the stats reporter host [11:58:05] ok, gotcha [11:58:17] if it is just a metric issue I was less worried [11:59:18] I created a logging only rule for our heavy originals scrapper, feel free to advocate for a ful ban if it is causing you headaches [12:04:50] Sadly, my codfw backup process is a bit backlogged by 24 hours, so I couldn't realize first person [12:08:18] im going to increase concurrency a bit [12:30:44] jynus: now the growth will look stupid, because the past N hours have all landed at once [12:31:46] all good, I was worried it had stopped completelly [12:31:54] not virtually :-D [12:32:12] I tend to look a lot at the growth graphs for my own stuff [13:17:00] Emperor: o/ sretest2010 is still in a weird state, I was able to install Trixie but reimage works one time out of 5/6 for some reason that I still don't get. I reported everything to SM.