[06:44:06] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on pc2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:51:40] <marostegui>	 ^ that can be ignored
[08:13:06] <cezmunsta>	 Is there a process defined for Alertmanager to silence alert rules ahead of time?
[08:13:38] <Emperor>	 there's a cookbook for downtiming hosts.
[08:13:54] <Emperor>	 (sre.hosts.downtime)
[08:18:11] <cezmunsta>	 Emperor: thanks, I wasn't sure if that did Alertmanager too. Having a read of the code now
[08:23:26] <Emperor>	 cezmunsta: It should do, yes, although it is on experience not 100%
[08:36:47] <Emperor>	 incident overnight has wedged a bunch of stats etc processes in codfw, I'm clearing them out now
[08:40:00] <bjensen>	 is db2218 being down a maintenance event or so?
[08:41:04] <federico3>	 not that I know
[08:42:58] <bjensen>	 it seems to be down and is paging (see operations), is there anything i could do to help?
[08:43:00] <federico3>	 bjsen looking
[08:43:04] <federico3>	 want to move to private?
[08:43:09] <bjensen>	 sounds good
[08:48:48] <jinxer-wm>	 FIRING: [14x] MysqlReplicationLagPtHeartbeat: MySQL instance db2159:9104 has too large replication lag (11m 16s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[08:52:24] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s7 on db2218 is CRITICAL: 788 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2218&var-port=9104
[08:54:06] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db2218:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:58:24] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s7 on db2218 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2218&var-port=9104
[09:01:48] <cezmunsta>	 Maybe I am not looking in the right place, maybe there aren't any to find... does anything check for host/DB uptime?
[09:03:10] <marostegui>	 cezmunsta: other than the console itself: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=now-3h&to=now&timezone=utc&var-server=db2218&var-datasource=000000026&var-cluster=mysql&refresh=5m and https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-3h&to=now&timezone=utc&var-job=$__all&var-server=db2218&var-port=9104&refresh=1m
[09:04:00] <cezmunsta>	 marostegui: sorry, I meant s/check/check and alert/
[09:04:44] <marostegui>	 not an alert per se on uptime, but we do alert on ping
[09:07:04] <cezmunsta>	 ack ... now found "[2026-05-15 08:38:37] HOST ALERT: db2218;DOWN;HARD;2;PING CRITICAL - Packet loss = 100%" in the alert history 
[09:39:01] <Emperor>	 Could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1287826 please, changing two more ms-be nodes to new-style storage?
[09:39:24] <Emperor>	 (not sure how many of us are actually here today)
[09:47:04] <cezmunsta>	 Emperor: reviewed, LGTM fwiw
[09:56:53] <Emperor>	 thanks :)
[09:58:00] <Emperor>	 cezmunsta: FWIW, local practice is generally the reviewer does +1 not +2, and then the submitter does the +2 [this matters less for puppet, but for some of our repos, setting +2 sets a bunch of automation off]
[10:00:57] <cezmunsta>	 tbh I was looking for a +1 and tried the code review button *thinking* that it might have a drop down ... lesson learned :) ... next lesson starting, how do I change in the UI?
[10:03:13] <Emperor>	 hit "Reply", you get a set of "verified" / "code review" boxes - hit +1 on code review, add some text, hit Send
[10:07:27] <cezmunsta>	 Yep, just found it whilst checking what was in Reply... noted for next time 
[11:09:06] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on pc2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:29:16] <jinxer-wm>	 FIRING: [11x] DiskSpace: Disk space backup1004:9100:/srv/objectstorage 0.6583% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space  - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[11:53:57] <jynus>	 Emperor: is swift @ codfw under maintenance?
[11:54:24] <jynus>	 Re: alerts, I silenced them for a month
[11:56:54] <Emperor>	 jynus: I was reimaging a couple of backends, but they're out of the rings. why?
[11:57:16] <Emperor>	 I do see some slighly-unhappy backends, I'll give them a kick
[11:57:25] <jynus>	 according to several metrics and alerts, no new images are being added to codfw
[11:57:31] <jynus>	 upload ratio shows 0
[11:57:52] <jynus>	 plus "MediaWiki swift object counts site diffs"
[11:57:56] <Emperor>	 jynus: metrics collection definitely sad overnight, but I thought I'd restarted all the necessary this morning. Maybe I'll just reboot the stats reporter host
[11:58:05] <jynus>	 ok, gotcha
[11:58:17] <jynus>	 if it is just a metric issue I was less worried
[11:59:18] <jynus>	 I created a logging only rule for our heavy originals scrapper, feel free to advocate for a ful ban if it is causing you headaches
[12:04:50] <jynus>	 Sadly, my codfw backup process is a bit backlogged by 24 hours, so I couldn't realize first person
[12:08:18] <jynus>	 im going to increase concurrency a bit
[12:30:44] <Emperor>	 jynus: now the growth will look stupid, because the past N hours have all landed at once
[12:31:46] <jynus>	 all good, I was worried it had stopped completelly
[12:31:54] <jynus>	 not virtually :-D
[12:32:12] <jynus>	 I tend to look a lot at the growth graphs for my own stuff
[13:17:00] <elukey>	 Emperor: o/ sretest2010 is still in a weird state, I was able to install Trixie but reimage works one time out of 5/6 for some reason that I still don't get. I reported everything to SM.