[02:12:48] FIRING: PuppetFailure: Puppet has failed on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:07:36] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:12:48] FIRING: PuppetFailure: Puppet has failed on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:07:36] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:53] thanos-be2003 has lost sde [08:32:40] T378800 filed, alert silenced [08:32:41] T378800: disk (sde) failed on thanos-be2003 - https://phabricator.wikimedia.org/T378800 [11:07:36] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:14] urandom: just to follow up on the issues you had with aqs1022 [14:11:46] I've not been able to reproduce what you found with the additional dns names starting with b-xxx, c-xxx etc [14:12:07] it may also be a quirk due to the underlying cause of the other problems, which is due to how we import the original IP into Netbox [14:12:21] ok [14:12:23] we have a fix for that, and I think once deployed that should mean the "additional IPs" script works ok [14:12:35] or at least on our test netbox server with the fix in place it runs smooth [14:12:43] have you any more nodes to add in the near future? [14:12:56] I guess the bad thing about a script like this, is that so much time passes between uses [14:13:10] haha, probably not until next quarter? [14:13:22] case in point :) [14:14:14] ok. well I think we shouldn't hit it next time, but maybe when you are doing it just ping me in advance, it will be good to know we have the root-cause issue resolved anyway, and see it running in anger [14:14:34] having the ids not start at [a-] wouldn't be the end of the world [14:14:52] technically, it shouldn't matter, the tooling we use tries to make no assumptions [14:15:37] I'm sure something does, and so you could argue it would be a feature to flush those out with some that violate the convention [14:16:12] yeah we should just keep it the same [14:16:17] but...they have to start somewhere, and 'a' is element of least surprise :) [14:16:33] tbh that whole script is very short, 30 lines or something, so really shouldn't give us many problems [14:16:46] * urandom knocks on wood [14:16:50] I didn't account for the input data (on existing primary IP) being wrong though, hopefully we've ironed that out [14:17:29] cool, thanks for following up [14:17:46] and sorry for pinging you while you were OoO :) [14:22:07] ah no problem I had my machine if anything was needed in an emergency :) [15:07:36] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:36] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:16:01] we have no database people working today, so I'm going to downtime db2239 until 08:15 UTC on Monday [16:16:29] AFAICT from T373579 that node isn't properly in prod yet, and there's not point us getting alerted about it all weekend. [16:16:35] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579