[02:31:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:01:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:42:34] <jinxer-wm>	 FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:06:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:42:34] <jinxer-wm>	 FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:05:56] <elukey>	 hello folks!
[08:06:10] <elukey>	 I'd need to re-run the provision cookbook for dbproxy102[8,9]
[08:06:13] <elukey>	 for https://phabricator.wikimedia.org/T376121
[08:06:24] <elukey>	 it requires a reboot, so the hosts need to be depooled
[08:06:31] <elukey>	 lemme know if you have time during the next days
[08:06:42] <arnaudb>	 elukey: checking rn, hold on
[08:08:44] <arnaudb>	 none of them are leader elukey so you are good to go, please let me know when they are rebooted! (https://github.com/wikimedia/operations-dns/blob/09ac50c967553d8cc47aa9c908fae2f4a78a92bb/templates/wmnet#L47-L54)
[08:09:28] <elukey>	 arnaudb: o/ thanks! So IIUC I can just proceed without specific precaution
[08:09:50] <arnaudb>	 indeed!
[08:10:27] <elukey>	 super proceeding now with 1028
[08:50:21] <elukey>	 all done!
[08:51:56] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:52:23] <arnaudb>	 everything looks green, haproxy have been reloaded "just incase"
[09:11:56] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on ms-be1075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:12:26] <Emperor>	 I've silence the puppet failure alerts for the 3 ms backends with dud disks (and created/updated DC tickets as appropriate)
[09:19:05] <Emperor>	 rclone errors were us losing two races with admin-deletion.
[09:21:56] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: systemd-timedated.service on ms-be1075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:23:30] <Emperor>	 ^-- those are just systemd-timedated explodes if there are any filesystems that say EIO (!)
[10:46:58] <elukey>	 Emperor: o/ - not sure if you already went through your list of gerrit changes while you were away, but I wanted to highlight https://gerrit.wikimedia.org/r/c/operations/puppet/+/1078380
[10:47:20] <elukey>	 worked nicely, nothing bad to expect but you should be aware
[10:48:37] <Emperor>	 that was the change adding the docker account to the exemption list for rate-limiting? Yeah, I saw that somewhere in a deluge of email. Beyond a slight twitchyness that this means the docker registry is more easily able to DoS the ms cluster, that seems OK
[10:48:55] <Emperor>	 (and that that would be Bad (TM) )
[10:49:06] <elukey>	 yeah I know :(
[10:49:46] <elukey>	 the other thing worth to mention is that docker-registry-upstream deprecated the support for swift, the next version out (not sure exactly when) works only with S3 and other backends
[10:50:02] <elukey>	 so probably we'll have to migrate the registry to APU at some point in the future
[10:50:17] <elukey>	 (so we'll be off ms entirely)
[10:52:46] <Emperor>	 APU ?
[10:53:18] <Emperor>	 (do you mean apus, the new ceph-based cluster, or something else?)
[10:53:43] <elukey>	 ah sorry I thought it was named apu, not apus
[10:54:06] <elukey>	 the last name that I heard was moss, not sure if the scope is the same or not
[10:54:15] <Emperor>	 NP. That will likely need some capacity planning work, so worth feeding into the appropriate annual planning / budget conversation
[10:54:52] <elukey>	 I am working on cleaning up a bit the registry's swift footprint, at the moment is around 6TB
[10:55:02] <elukey>	 but we never really cleaned up anything from the beginning of time
[10:55:10] <Emperor>	 :sadpanda:
[11:22:34] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:23:32] <Emperor>	 that's a silence expiring, I'll extend it.
[12:00:48] <jinxer-wm>	 FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2149:9104 has too large replication lag (11m 59s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2149&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[12:05:48] <jinxer-wm>	 RESOLVED: MysqlReplicationLagPtHeartbeat: MySQL instance db2149:9104 has too large replication lag (12m 59s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2149&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[13:11:56] <jinxer-wm>	 FIRING: SystemdUnitFailed: ceph-3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18@osd.33.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:13:18] <Emperor>	 that looks like a disk has died; I'll get to it after our meeting
[16:03:11] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: ceph-3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18@osd.33.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:38:41] <Amir1>	 zuwiki geo_tags {1: ['db2205 (codfw master)'], 48434: ['db1189 (eqiad master)', ....]
[19:38:44] <Amir1>	 how