[08:10:13] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180#10414909 (10SLyngshede-WMF) ` cas.theme.default-theme-name=wikimedia # WebAuthN cas.authn.mfa.web-authn.core.application-id=https://idp-test.wikimedia.org cas.authn.mfa.web-authn... [09:44:59] 10netops, 06Infrastructure-Foundations, 06SRE: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10415035 (10fgiunchedi) Thank you all for looking into this -- let's indeed see how `3m` (or larger) goes and if that is satisfactory! >>! In T382396#10413490,... [09:53:20] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: allow cookbooks to abort execution from __init__ - https://phabricator.wikimedia.org/T365454#10415055 (10Volans) a:03Volans [12:46:19] hi, we currently see DHCP requests in codfw and eqiad (multiple hosts) failing/not receiving offers - is anything being worked on that could cause this? [12:46:22] cc jelto [14:14:01] maybe it works sometimes...I've one server successfully PXE booting but then it hangs in di network config, asking me to enter nameservers [14:17:03] jayme: you mean something that changed since yesterday/ [14:17:04] ? [14:17:22] or during the day...first 4 hosts where fine [14:17:41] I haven't touched anything related [14:17:59] retrying auto network config in di made it work...so I'm suspecting something DHCP related [14:18:52] being flaky or maybe there's a rogue dhcp server [14:19:30] can you give me a hostname to check? [14:26:31] wikikube-worker2055 and 2056 were also stuck two times but third time works now (at least the installer is doing something after dhcp and networking [14:31:50] jelto: if I grep for wikikube-worker2055's MAC on install2004's dhcp logs I see the first DHCPDISCOVER at Dec 19 14:20:18, that resulted in a lease shortly after and then a new one (d-i?) at 14:22:27 [14:31:54] nothing prior to that [14:32:37] topranks: any of your today's work could be related? [14:34:02] volans: no not at all [14:35:47] 14:20 was the third try, I tried a first time at 11:02 and there 2055 and 2056 did not reimage properly. Both were stuck at "Unable to verify that the host rebooted into the new OS, it might still be in the Debian installer, please verify manually with: sudo install-console wikikube-worker2055.codfw.wmnet" [14:36:27] and they were? [14:36:30] in the old OS [14:39:11] yes at least in the serial console I had a login promt "wikikube-worker2056 login:". 2055 is reimaging currently and it's looking good, I'll retry wikikube-worker2056 next [14:39:45] so it seems it never pxe-booted [14:39:49] doesn't seem a dhcp problem to me [14:39:58] there is no evidence a dhcp request was ever made [14:52:30] I did not follow the serial console from the beginning, just when I noticed something is off. And then I was greeted by the login prompt of the existing debian installation. [15:01:10] 10Mail, 06Infrastructure-Foundations, 10Observability-Metrics, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867#10415850 (10fgiunchedi) 05Open→03Invalid No longer valid I think, also MXes now use postfix [15:03:36] 10netops, 06Infrastructure-Foundations, 10Observability-Metrics, 06SRE: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data - https://phabricator.wikimedia.org/T251155#10415860 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Done in {T370506} [15:41:31] I'll go read the docs, but just in case someone happens to know off the top of their head, how do I check from a cookbook whether a given hostname is in puppetdb? [15:43:52] kamila_: literally whether it exists? [15:44:20] yes cdanis [15:44:40] I'm writing a roll-reimage cookbook and I want to check whether I need to reimage with --new [15:45:29] I think perhaps cookbooks.sre.puppet.get_puppet_version() [15:48:40] I'll look, thanks! at least one of that and a puppetdb query will work :D appreciated! [15:49:30] that effectively *is* a puppetdb query, but via a microservice [15:49:35] right, that's perfect [15:49:37] thanks a ton! [15:49:56] BYOSLO (bring your own session-like object) [15:50:40] kamila_: if you do a remote query with the fqdn that's querying puppetdb [15:50:49] that's how the reimage cookbook detects it [15:50:53] self.remote_host = self.remote.query(self.fqdn) [15:51:14] right volans, that's what I meant [15:51:15] thanks! [15:51:39] see the reimage aroind lines 163-175 [15:52:01] we can also change the reimage and if --force is set [15:52:06] bypass the ask confirmation there [15:52:09] and DTRT [15:52:12] as it does already [15:52:24] basically forcing --new or removing it based on its existence [15:52:40] the latter seems the best approach IMHO [15:53:03] oh, right, that's also an option I suppose [15:53:05] no need to duplicate code [15:53:08] thanks volans! yeah [15:55:20] should have asked before I moved my code around so I could make it conditional :D [16:10:55] hm, looks like --force doesn't skip that confirmation? [16:11:19] volans| we can also change the reimage and if --force is set [16:11:22] volans| bypass the ask confirmation there [16:11:23] ^^^ [16:12:06] oh, you meant I should patch it, not that it's already that way [16:12:09] sure, I can do that :D [16:12:13] indeed :) [16:12:30] force was introduced to help less-controlled reimages so seems to fit your use case [16:12:38] and seems logical to use it also for this behaviour [16:15:57] yes, I thought you meant that it already does that because that would have been logical [16:16:00] 10netops, 06Infrastructure-Foundations, 06SRE: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10416237 (10cmooney) >>! In T382396#10415035, @fgiunchedi wrote: > Yes and that's almost always the case, my understanding though is that the samples may not alw... [16:16:43] have a patch :D https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1105752 [16:18:07] {done} [16:18:45] (I somehow expected ask_confirmation to be handling --force internally, which would actually be ugly since it is not passed a context '^^ I think the click library does that, but it also passes a context object) [16:19:12] ty volans <3 [16:21:59] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518 (10cmooney) 03NEW p:05Triage→03Low [16:25:57] 10netops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519 (10cmooney) 03NEW p:05Triage→03Low [16:28:59] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10416342 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7fe2fd80-b4a4-43f7-ba5a-5238c44bbd7a) set by cmooney@cumin1002 for 30 days,... [16:31:44] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [16:35:49] 10netops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519#10416397 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=68d77968-a0dd-4bd1-94ad-66be8ab508c5) set by cmooney@cumin1002 for 30 days, 0:00:00 on 2... [16:36:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [16:41:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [18:53:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:58:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:48:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [19:58:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [22:59:15] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-extensions-BounceHandler: Valid email address was unconfirmed for temporary spam blacklisting - https://phabricator.wikimedia.org/T99444#10417605 (10Mail.faluzes) I think [[ https://www.google.com/amp/www.iptvservice.site/?e=