[02:31:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:01:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:42:34] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:06:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:34] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:05:56] hello folks! [08:06:10] I'd need to re-run the provision cookbook for dbproxy102[8,9] [08:06:13] for https://phabricator.wikimedia.org/T376121 [08:06:24] it requires a reboot, so the hosts need to be depooled [08:06:31] lemme know if you have time during the next days [08:06:42] elukey: checking rn, hold on [08:08:44] none of them are leader elukey so you are good to go, please let me know when they are rebooted! (https://github.com/wikimedia/operations-dns/blob/09ac50c967553d8cc47aa9c908fae2f4a78a92bb/templates/wmnet#L47-L54) [08:09:28] arnaudb: o/ thanks! So IIUC I can just proceed without specific precaution [08:09:50] indeed! [08:10:27] super proceeding now with 1028 [08:50:21] all done! [08:51:56] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:23] everything looks green, haproxy have been reloaded "just incase" [09:11:56] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on ms-be1075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:26] I've silence the puppet failure alerts for the 3 ms backends with dud disks (and created/updated DC tickets as appropriate) [09:19:05] rclone errors were us losing two races with admin-deletion. [09:21:56] RESOLVED: [2x] SystemdUnitFailed: systemd-timedated.service on ms-be1075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:30] ^-- those are just systemd-timedated explodes if there are any filesystems that say EIO (!) [10:46:58] Emperor: o/ - not sure if you already went through your list of gerrit changes while you were away, but I wanted to highlight https://gerrit.wikimedia.org/r/c/operations/puppet/+/1078380 [10:47:20] worked nicely, nothing bad to expect but you should be aware [10:48:37] that was the change adding the docker account to the exemption list for rate-limiting? Yeah, I saw that somewhere in a deluge of email. Beyond a slight twitchyness that this means the docker registry is more easily able to DoS the ms cluster, that seems OK [10:48:55] (and that that would be Bad (TM) ) [10:49:06] yeah I know :( [10:49:46] the other thing worth to mention is that docker-registry-upstream deprecated the support for swift, the next version out (not sure exactly when) works only with S3 and other backends [10:50:02] so probably we'll have to migrate the registry to APU at some point in the future [10:50:17] (so we'll be off ms entirely) [10:52:46] APU ? [10:53:18] (do you mean apus, the new ceph-based cluster, or something else?) [10:53:43] ah sorry I thought it was named apu, not apus [10:54:06] the last name that I heard was moss, not sure if the scope is the same or not [10:54:15] NP. That will likely need some capacity planning work, so worth feeding into the appropriate annual planning / budget conversation [10:54:52] I am working on cleaning up a bit the registry's swift footprint, at the moment is around 6TB [10:55:02] but we never really cleaned up anything from the beginning of time [10:55:10] :sadpanda: [11:22:34] FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:23:32] that's a silence expiring, I'll extend it. [12:00:48] FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2149:9104 has too large replication lag (11m 59s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2149&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [12:05:48] RESOLVED: MysqlReplicationLagPtHeartbeat: MySQL instance db2149:9104 has too large replication lag (12m 59s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2149&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [13:11:56] FIRING: SystemdUnitFailed: ceph-3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18@osd.33.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:18] that looks like a disk has died; I'll get to it after our meeting [16:03:11] RESOLVED: SystemdUnitFailed: ceph-3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18@osd.33.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:38:41] zuwiki geo_tags {1: ['db2205 (codfw master)'], 48434: ['db1189 (eqiad master)', ....] [19:38:44] how