[00:24:37] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [00:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [00:46:49] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:49:25] FIRING: [4x] SystemdUnitFailed: backup-kdc-database.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:24:37] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [04:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [04:46:49] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:49:25] FIRING: [4x] SystemdUnitFailed: backup-kdc-database.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:37] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [08:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [08:46:49] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:09:42] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10771974 (10ayounsi) Another oddity is that subscribing to `/components/component/transceiver/physical-channels/channel/state` a... [12:24:37] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [12:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [13:04:10] 10CAS-SSO, 10Bitu, 06Infrastructure-Foundations: Logging out of idm does not log me out - https://phabricator.wikimedia.org/T392350#10772465 (10Arendpieter) [13:37:51] XioNoX: thanks for adding me on the email re: Ufinet. wanted to check, was that mostly for awareness or you wanted me to follow up with the list of IPs? [13:43:55] sukhe: just awareness [13:44:09] thanks! [14:17:54] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10772800 (10elukey) The host was reimaged on the 5th afaics: ` 2024-05-06 09:10:50,421 marostegui 595479 [DEBUG _cookbook.py:511 in main] Executing cookbook sre.hosts.reimage with args: ['--os', 'bookworm', '-t', 'T363... [14:30:33] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10772825 (10taavi) Anything left to do here? [14:32:01] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10772843 (10ayounsi) p:05Triage→03Medium [14:32:17] 10netops, 10Hiddenparma, 06Infrastructure-Foundations: Reduce the steps needed to deploy hiddenparma - https://phabricator.wikimedia.org/T382268#10772846 (10CDanis) p:05Triage→03Low [14:33:33] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10772850 (10jhathaway) Thanks @elukey, perhaps puppetserver needs to be reloaded to pick up the revoke, and this didn't happen until more recently? [14:43:40] if anybody has time for a quick patch https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1139407 [14:45:48] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10772886 (10ssingh) >>! In T379927#10772825, @taavi wrote: > Anything left to do here? Nothing on the prod DNS hosts side; if you k... [15:31:21] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10773131 (10elukey) One thing that I see is that the reimage failed: ` 2024-05-06 10:31:27,368 marostegui 595479 [INFO _log.py:125 in log_task_end] END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1178... [15:36:22] cookbooks question: I have a service that which after restart and because the host will always be pooled is in state of warning [15:36:44] this fails the cookbook though and that's not what I want. I can manually ACK the service I guess and that's fine but I want it to be fully automated and skip a particular service [15:37:07] any clean ways of achieving that? I can't seem to find anything relevant in https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html [15:37:29] other than ACKing the alert, which like I shared, is not ideal (and also doesn't work preemptively?) [15:37:42] sukhe: I didn't get the first part, what is the issue with the service? [15:38:13] elukey: the service will always be returning a WARNING, until the host is pooled, which happens in the end of the cookbook run [15:38:16] > The check was skipped as the host is not pooled for authdns-update [15:38:34] but the cookbook fails as it says that not all services recovered: [15:38:37] [4/15, retrying in 12.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal..check' raised: Not all services are recovered: dns2005:NTP peers and stratum check,check if authdns-update was run after a change was merged to operations/dns.git [15:38:47] (ignore the NTP one, that recovers when the host syncs) [15:39:13] so now until I ACK the "check if authdns-update was run" alert, the cookbook will fail [15:39:35] makes sense yes [15:40:20] and is the cookbook dns admin.py? [15:40:46] no sorry, one of the roll restarts probably [15:41:03] yep [15:41:13] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/dns/roll-reboot.py [15:41:21] okok checking [15:41:57] I guess I can use run_icinga_command to manually ACK the service? [15:42:11] that way the cookbook will proceed with skip_acked [15:43:50] so in the parent class there is icinga_hosts.wait_for_optimal(skip_acked=True) [15:43:56] that is consistent with what you are seeing [15:43:59] yeah [15:44:43] but that is simply the "action" definition, in theory you can override it [15:44:54] yeah... [15:45:02] I am just setting that via Icinga manually which does not work for me :P [15:45:15] yes that's for sure :D [15:45:27] so perhaps I was thinking, I should set it via the cookbook perhaps using https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.IcingaHosts.run_icinga_command? [15:45:34] or, do you know of a better way? [15:46:28] what I was trying to suggest is that you can override "action" in the dns cookbook and skip the wait_for_optimal [15:46:31] as possible road [15:46:55] but that would mean that it would skip other services too? [15:47:01] exactly yes [15:47:32] I never acked a service programmatically via spicerack, but in theory it should work [15:47:50] you override action, ack the services that you want, and then call super.action() [15:48:35] yeah I guess [15:49:07] it should be clean, and if it works we can try to add a specific/handy helper in spicerack [15:49:20] so people don't have to come up with a more complex api call [15:51:03] yeah thanks, I think that would be nice. I guess I can simply not return a WARNING for this check if the host is not pooled but I don't want to [15:51:10] or another alternative could be to modify wait_for_optimal in spicerack, with a list of services to skip [15:51:21] thanks elukey! I will check the relevant Icinga command and then try to pass that [15:51:28] elukey: that would be ideal yep [15:51:37] in fact, that's what I was looking for in the docs [15:51:43] I for some reason imagined it already was there [15:52:06] it doesn't seem possible at the moment (please note: premium support from Riccardo is currently not available, this is "Standard" support :D) [15:52:25] but it should be easy to do [15:53:24] haha [15:53:32] you are premium support too <3 [15:53:37] I wish :D [15:53:38] <3 [15:53:46] how soon do you need it? I can try to work on it tomorrow [15:54:20] no worries at all [15:54:32] this is not urgent and I can certainly ACK [15:54:35] perhaps later we can do it [15:55:38] sure sure, I think it could be a first good step, then we can work on spicerack [16:24:37] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [16:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [19:05:11] 10SRE-tools, 10Spicerack, 06Traffic: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848 (10ssingh) 03NEW [19:05:32] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10774076 (10ssingh) p:05Triage→03Low [20:24:37] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [20:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts