[02:12:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:07:36] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:12:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:07:36] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:15:53] <Emperor>	 thanos-be2003 has lost sde
[08:32:40] <Emperor>	 T378800 filed, alert silenced
[08:32:41] <stashbot>	 T378800: disk (sde) failed on thanos-be2003 - https://phabricator.wikimedia.org/T378800
[11:07:36] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:11:14] <topranks>	 urandom: just to follow up on the issues you had with aqs1022 
[14:11:46] <topranks>	 I've not been able to reproduce what you found with the additional dns names starting with b-xxx, c-xxx etc 
[14:12:07] <topranks>	 it may also be a quirk due to the underlying cause of the other problems, which is due to how we import the original IP into Netbox 
[14:12:21] <urandom>	 ok
[14:12:23] <topranks>	 we have a fix for that, and I think once deployed that should mean the "additional IPs" script works ok 
[14:12:35] <topranks>	 or at least on our test netbox server with the fix in place it runs smooth 
[14:12:43] <topranks>	 have you any more nodes to add in the near future?
[14:12:56] <urandom>	 I guess the bad thing about a script like this, is that so much time passes between uses
[14:13:10] <urandom>	 haha, probably not until next quarter?
[14:13:22] <topranks>	 case in point :) 
[14:14:14] <topranks>	 ok.  well I think we shouldn't hit it next time, but maybe when you are doing it just ping me in advance, it will be good to know we have the root-cause issue resolved anyway, and see it running in anger 
[14:14:34] <urandom>	 having the ids not start at [a-] wouldn't be the end of the world
[14:14:52] <urandom>	 technically, it shouldn't matter, the tooling we use tries to make no assumptions
[14:15:37] <urandom>	 I'm sure something does, and so you could argue it would be a feature to flush those out with some that violate the convention
[14:16:12] <topranks>	 yeah we should just keep it the same 
[14:16:17] <urandom>	 but...they have to start somewhere, and 'a' is element of least surprise :)
[14:16:33] <topranks>	 tbh that whole script is very short, 30 lines or something, so really shouldn't give us many problems 
[14:16:46] * urandom knocks on wood
[14:16:50] <topranks>	 I didn't account for the input data (on existing primary IP) being wrong though, hopefully we've ironed that out 
[14:17:29] <urandom>	 cool, thanks for following up
[14:17:46] <urandom>	 and sorry for pinging you while you were OoO :)
[14:22:07] <topranks>	 ah no problem I had my machine if anything was needed in an emergency :) 
[15:07:36] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:07:36] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:16:01] <Emperor>	 we have no database people working today, so I'm going to downtime db2239 until 08:15 UTC on Monday
[16:16:29] <Emperor>	 AFAICT from T373579 that node isn't properly in prod yet, and there's not point us getting alerted about it all weekend.
[16:16:35] <stashbot>	 T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579