[00:57:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s4.service on db1150:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db1150:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:11:51] are those alerts from db1150 expected? [07:13:17] (I dunno if it'll help, but my toot about our vacancy has got a reasonable amount of boosting) [07:15:03] yes [07:15:11] I am "fixing it" [07:16:16] I also mentioned we should get rid of predictive disk failure, because provisioning makes fire it [07:22:25] RESOLVED: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db1150:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:52] jynus: sorry to be a bore; I know you've previously reviewed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1190674 but would you mind having another look, please? I'm now leaving ms-be1088 out of the rings so e.lukey can do some boot-testing on it [08:42:09] checking [08:43:57] lgtm [08:44:51] TY :) [09:27:51] zabe: do you want to announce the rev_sha1 thing or should I? [09:35:41] Amir1: Do you mean on the cloud mailing list? [09:35:47] si [09:37:13] I can do it [09:37:27] go for it [09:37:39] How long do we want to give folks to migrate away? [09:37:45] three weeks? [09:37:52] sounds good [09:38:10] Thank you! [10:40:25] FIRING: SystemdUnitFailed: puppet-agent-timer.service on ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:43:57] (/Stage[main]/Profile::Swift::Storage/Swift::Init_device[/dev/sdf]/Exec[mkfs-/dev/sdf1]/unless) Check "xfs_admin -l /dev/sdf1" exceeded timeout [10:45:02] (said command returns almost immediately now) [10:50:25] RESOLVED: SystemdUnitFailed: puppet-agent-timer.service on ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:22] that's interesting, since it's still running (and this time it's xfs_db -x -p xfs_admin -r -c label /dev/sdi1 that seems stuck) [10:52:19] If that doesn't sort itself out soon, I'll reboot it. [10:56:51] going for lunch and I will merge 1192501 when I come back [10:57:58] sort> it didn't. Rebooting. [14:12:50] federico3: can you run your schema change on eqiad hosts? T401906 [14:12:51] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:13:38] actually let me just run it with replication on a couple of them [14:13:52] sure [14:14:09] want to try using the wrapper perhaps? [14:15:15] no, it's a different thing, when eqiad is depooled, I can just run the alter on master of the dc with replication [14:15:37] but not right now actually, I have to go. Feel free to start the script (which is safer) on the eqiad hosts for now [14:15:44] I try to get it to tomorrow [14:16:01] ok [15:43:20] I'd like a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192575 please, to set new ms-be nodes being ordered to use our ms-be efi preseed setup. [15:59:10] looking [16:01:41] why don't we unroll these regexps 😭 [16:08:59] I think they're globs not regexes [16:44:09] FTR, we caused a spike in sessionstore GET requests for ~2 hours (4-5x the normal value)