[00:13:59] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on thanos-be2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:17:34] <jinxer-wm>	 RESOLVED: DiskSpace: Disk space thanos-be2006:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=thanos-be2006 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[00:58:59] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on thanos-be2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:10:07] <Emperor>	 Can I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1270860 please? I'm going to try and migrate ms-be206[8,9] to our newer storage layout while they're out of the rings for the VLAN move
[08:13:51] <Emperor>	 TY :)
[09:36:20] <federico3>	 @marostegui ah the puppet agent failed and the script crashed, 
[09:36:46] <marostegui>	 federico3: probably because puppet is disabled everywhere
[09:38:02] <federico3>	 I should update the script to leave the downtime in place on failure tho
[09:43:48] <jinxer-wm>	 FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2144:9104 has too large replication lag (11m 29s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2144&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[10:03:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ms-be2062:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:06:19] <moritzm>	 the cumin1003 reboot/update was planned for , but there's a reimage of db1168 running, a schema change for s1-codfw-replicas and I can also see some netcat data transfer ongoing
[10:06:39] <moritzm>	 why haven't these been started from cumin2002? can you let us know when these are done so that we can proceed?
[10:06:40] <marostegui>	 moritzm: db1168 reimage will be done in a few min
[10:06:43] <moritzm>	 ok
[10:06:56] <marostegui>	 moritzm: was the time/day communicated? I missed it if it was
[10:07:15] <moritzm>	 I asked last week here and I sent a mail to sre-at-large@ so kinda yes :-)
[10:07:40] <marostegui>	 moritzm: I recall seeing it but i totally forgot - my apologies
[10:07:42] <marostegui>	 I guess federico3 too?
[10:07:53] <moritzm>	 sure. it's not a big deal, just let us know when it's good to proceed
[10:08:08] <marostegui>	 moritzm: db1168 will be finished in around 20 mins or so, the other tasks are probably from either federico3 and/or jynus ?
[10:08:44] <marostegui>	 db1168 should have finished a lot earlier but it was blocked on the puppet issues
[10:08:45] <jynus>	 not me
[10:08:57] <jynus>	 backups finished at 6:42
[10:09:17] <moritzm>	 jynus: there's an nc (pid 3425065), but on a closer look it's actually from 2025 so not very relevant?
[10:09:23] <marostegui>	 s1 codfw replicas are federico3 
[10:09:26] <jynus>	 mmm
[10:09:34] <jynus>	 that must be a stuck thing
[10:09:37] <jynus>	 let me check
[10:09:38] <marostegui>	 and I am not sure if Amir1's optimize is running from 1003 or 2002
[10:09:50] <federico3>	 moritzm: I got a bit confused by having to move away from bast1003 in the same days, doublechecking cumin1003
[10:10:01] <moritzm>	 ok!
[10:10:51] <jynus>	 I killed the process
[10:11:26] <marostegui>	 moritzm: all done from my side
[10:11:38] <moritzm>	 thanks
[10:11:51] <federico3>	 moritzm: same here, all done
[10:12:38] <moritzm>	 thanks all, I'll give another headsup on -sre and proceed
[10:12:55] <moritzm>	 will let you know when it's back and updated to Cumin 6
[10:12:58] <marostegui>	 +1 thanks and sorry about the delay
[10:13:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on ms-be2062:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:14:12] <Emperor>	 ms-be2062 alert above was I think just a consequence of the puppet work e.lukey was doing
[10:24:59] <moritzm>	 cumin1003 can be used again, enjoy :-)
[10:26:19] <marostegui>	 thanks moritzm 
[10:27:54] <federico3>	 thanks
[10:29:42] <moritzm>	 there are some wmfmariadbpy packages flagged for downgrade, these are probably test packages?
[10:30:09] <moritzm>	 I left them as-is as part of the update, but if someone could doublecheck that this is expected, that would be good
[10:32:27] <federico3>	 looking
[10:33:14] <federico3>	 moritzm: downgrade? I'm seeing upgradable ones
[10:33:23] <federico3>	 https://www.irccloud.com/pastebin/6C46NCPa/
[10:34:22] <moritzm>	 sorry, wrong name: I meant python-wmfdb and wmfdb-admin
[10:34:33] <moritzm>	 these are flagged for downgrades on cumin1003 currently
[10:34:51] <moritzm>	 I lost a little track  which of these get built from which source packages
[10:35:40] <federico3>	 from https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/blob/main/debian/control?ref_type=heads
[10:48:18] <Emperor>	 apologies for the email spam from ms-be2068 (which is in the midst of having its storage redone); I think I've tweaked it to not make noise now
[11:37:32] <Amir1>	 regarding cumin1003, I stopped it last night before sleep so no issues on my side
[11:38:37] <Emperor>	 OK, I've worked out why the reimage is bust - could someone +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1270899 please? I needed to add the hosts in two places, only caught one in the previous CR :(
[11:43:26] <federico3>	 I created a copy of the first panel in https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-3h&to=now&timezone=utc&var-site=eqiad&var-group=core&var-shard=ms1&var-shard=ms2&var-shard=ms3&var-role=%24__all  "Query Throughtput" to have better resolution (see the second panel). If people like it I can remove the first panel.
[11:44:37] <Emperor>	 looks tidy to me, but I'm not really the target :)
[11:49:23] <federico3>	 I'm copying few other panels to increase resolution in the same way and the pooling in of sections triggers some fluctuations that look like a load balancing algorithm being not too smooth
[11:51:13] <Emperor>	 can I derail you to have a quick look at my partman_early_command CR, please? :)
[11:54:39] <federico3>	 looking
[11:56:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on db1196:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:58:19] <Emperor>	 TY!
[12:01:48] <jinxer-wm>	 FIRING: [2x] PuppetFailure: Puppet has failed on db1167:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:01:54] <federico3>	 there's auto schema running on that (but it should not affect puppet...?)
[12:02:46] <federico3>	 actually no, it's running on that section but not on that host
[12:31:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on db1196:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:24:23] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on dborch1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:24:41] <marostegui>	 federico3: ^
[13:25:03] <federico3>	 oh dborch
[13:32:06] <federico3>	 it failed on ACME as the service is not running, I guess we shut down the host as the current host has been stable for many weeks
[13:34:25] <marostegui>	 federico3: sure, we can power it off but not decommission it till 1003 is behind the cnd
[13:34:26] <marostegui>	 cdn
[17:24:38] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on dborch1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:45:38] <marostegui>	 federico3: ^ so what was decided about this issue?
[17:47:03] <federico3>	 @marostegui I'm ok with shutting it down but wanted to ask you if I can just sudo halt the host or there's a process to flag it as powered off in puppet & co
[17:48:25] <marostegui>	 federico3: you'll have to disable notifications and all that but I think you can just power it off for now temporarily and then decom once we are happy with it. If unsure about it ask Luca to make sure nothing else is needed
[17:49:24] <federico3>	 ok
[17:56:52] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s7 on db2150 is CRITICAL: 12.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2150&var-port=9104
[17:58:44] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s7 on db2168 is CRITICAL: 10 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2168&var-port=9104
[17:59:44] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s7 on db2168 is OK: (C)10 ge (W)5 ge 3.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2168&var-port=9104
[17:59:54] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s7 on db2150 is OK: (C)10 ge (W)5 ge 3 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2150&var-port=9104
[19:11:46] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s7 on db2168 is CRITICAL: 13.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2168&var-port=9104
[19:12:30] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s7 on db1170 is CRITICAL: 10 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1170&var-port=9104
[19:12:34] <federico3>	 what's up with this host?
[19:14:30] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s7 on db1170 is OK: (C)10 ge (W)5 ge 1.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1170&var-port=9104
[19:14:46] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s7 on db2168 is OK: (C)10 ge (W)5 ge 2.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2168&var-port=9104
[20:48:34] <jinxer-wm>	 FIRING: DiskSpace: Disk space thanos-be2006:9100:/ 3.709% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=thanos-be2006 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[21:24:38] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on dborch1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[22:38:59] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on thanos-be2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed