[04:43:10] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:58:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:36] federico3: you working with db1176 right? [08:05:51] yes [08:07:55] got it thanks [09:36:19] Oh. ms-be1081 has decided it has no disk controller any more [09:38:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1176:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:48] FIRING: PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:43:10] is anyone working on es1033, it's been alerting for a while now? [10:43:48] FIRING: [2x] PuppetFailure: Puppet has failed on db1154:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:44:19] okay both sanitarium hosts are failing [10:44:20] Emperor: It was used to clone another host, I just fixed it. Also the puppet errors are expected as gerrit is down [10:44:33] Amir1: because of gerrit being down [10:44:36] ah cuz gerrit [10:47:31] ta [10:53:48] FIRING: [2x] PuppetFailure: Puppet has failed on db1154:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:23:48] RESOLVED: PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:18:10] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on es2054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:01] Amir1: I've rebooted all the ms backends now (other than the various broken ones); want to pick a day for me to do the frontends? [13:54:33] Sure. let me see how far the deletions have gone [13:57:54] so on codfw, you can go ahead with anything beside 2009, 2010, 2011. They are not running anything [13:58:12] on eqiad, it's up to 1014 (1015 and onwards should be okay) [13:58:30] the good thing is eqiad will finish soon, it's 80% done from what I'm seeing [13:58:39] Amir1: Ideally, I'd like to just lazily run the roll-restart cookbook and do them all in one fell swoop [13:58:42] next week should be fine for eqiad [13:59:29] codfw would be later next week [13:59:48] I think that's a reasonable timescale; you OK to ping me when ready for me to go for each DC? [14:00:10] sounds good to me [14:01:06] OTOH, would it be okay if we put the mw auth creds in all frontends temporarily, it makes my life much easier when I'm starting new batches [14:01:11] (specially after restart) [14:01:35] so I don't have to manually copy the creds into five extra hosts every time [14:01:54] uhm what are the alerts on maps-test...? [14:02:04] we can revert it once the deletions are done [14:02:20] It'd be a bit of a pain puppet-wise, because they should only be one stats reporter host per cluster (and the creds only get copied to the stats reporter host) [14:03:47] ah :((( [14:04:16] sorry [14:43:10] FIRING: [3x] SystemdUnitFailed: prometheus-mysqld-exporter.service on es2054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:40] if anyone experiences a moment of boredom, I'm looking for a sanity check of: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195276 (but there is no hurry whatsoever) [14:54:22] RESOLVED: [2x] SystemdUnitFailed: ferm.service on restbase2030:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:28] thanks Empero.r [15:46:55] Can I get a quick review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195769? Not complicated but applies to db servers