[04:43:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:58:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:36] <marostegui>	 federico3: you working with db1176 right?
[08:05:51] <federico3>	 yes
[08:07:55] <marostegui>	 got it thanks
[09:36:19] <Emperor>	 Oh. ms-be1081 has decided it has no disk controller any more
[09:38:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1176:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:53:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:38:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:43:10] <Emperor>	 is anyone working on es1033, it's been alerting for a while now?
[10:43:48] <jinxer-wm>	 FIRING: [2x] PuppetFailure: Puppet has failed on db1154:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:44:19] <Amir1>	 okay both sanitarium hosts are failing
[10:44:20] <marostegui>	 Emperor: It was used to clone another host, I just fixed it. Also the puppet errors are expected as gerrit is down
[10:44:33] <marostegui>	 Amir1: because of gerrit being down
[10:44:36] <Amir1>	 ah cuz gerrit
[10:47:31] <Emperor>	 ta 
[10:53:48] <jinxer-wm>	 FIRING: [2x] PuppetFailure: Puppet has failed on db1154:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:23:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:18:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on es2054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:54:01] <Emperor>	 Amir1: I've rebooted all the ms backends now (other than the various broken ones); want to pick a day for me to do the frontends?
[13:54:33] <Amir1>	 Sure. let me see how far the deletions have gone
[13:57:54] <Amir1>	 so on codfw, you can go ahead with anything beside 2009, 2010, 2011. They are not running anything
[13:58:12] <Amir1>	 on eqiad, it's up to 1014 (1015 and onwards should be okay)
[13:58:30] <Amir1>	 the good thing is eqiad will finish soon, it's 80% done from what I'm seeing
[13:58:39] <Emperor>	 Amir1: Ideally, I'd like to just lazily run the roll-restart cookbook and do them all in one fell swoop
[13:58:42] <Amir1>	 next week should be fine for eqiad
[13:59:29] <Amir1>	 codfw would be later next week
[13:59:48] <Emperor>	 I think that's a reasonable timescale; you OK to ping me when ready for me to go for each DC?
[14:00:10] <Amir1>	 sounds good to me
[14:01:06] <Amir1>	 OTOH, would it be okay if we put the mw auth creds in all frontends temporarily, it makes my life much easier when I'm starting new batches
[14:01:11] <Amir1>	 (specially after restart)
[14:01:35] <Amir1>	 so I don't have to manually copy the creds into five extra hosts every time
[14:01:54] <federico3>	 uhm what are the alerts on maps-test...?
[14:02:04] <Amir1>	 we can revert it once the deletions are done
[14:02:20] <Emperor>	 It'd be a bit of a pain puppet-wise, because they should only be one stats reporter host per cluster (and the creds only get copied to the stats reporter host)
[14:03:47] <Amir1>	 ah :(((
[14:04:16] <Emperor>	 sorry
[14:43:10] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-mysqld-exporter.service on es2054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:50:40] <urandom>	 if anyone experiences a moment of boredom, I'm looking for a sanity check of: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195276 (but there is no hurry whatsoever)
[14:54:22] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: ferm.service on restbase2030:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:12:28] <urandom>	 thanks Empero.r 
[15:46:55] <andrewbogott>	 Can I get a quick review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195769?  Not complicated but applies to db servers