[00:24:36] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [00:24:37] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [02:05:26] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:24:36] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [04:24:37] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [06:05:26] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:25:26] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:36] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [08:24:37] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [08:44:25] FIRING: SystemdUnitFailed: krb5-kdc.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:25] FIRING: [2x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:25] FIRING: [2x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:14] ^ krb1002 is in setup [10:21:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:39:25] FIRING: [3x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:36] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [12:24:37] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [12:27:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [12:44:25] FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:56] I'm seeing a TLS cert error at https://puppetboard.wikimedia.org/report/db1178.eqiad.wmnet/2648b4e15246c9ba5bf24ad499312c438f1f2045 when db1178.eqiad.wmnet is trying to reach https://puppetserver1001.eqiad.wmnet:8140/puppet/... - curl-ing from the same host is not triggering cert errors at the moment - was there a transient issue with the cert perhaps? [13:09:25] FIRING: [5x] SystemdUnitFailed: replicate-krb-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:25] FIRING: [5x] SystemdUnitFailed: replicate-krb-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:23:10] elukey: perhaps you know about the error above? [14:24:58] federico3: I can check, but never seen it.. has the host being reimage or similar recently? [14:26:52] elukey: it's been up a long time but rebooted 10 days ago [14:27:20] mmmm it seems to me that somehow puppetserver1001 may be in trouble responding for some reason, and db1178 gets the failures [14:27:34] it is not consistent with all the puppet runs afaics [14:28:19] nothing out of the ordinary from https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=puppetserver1001&var-datasource=thanos&var-cluster=puppet&from=now-24h&to=now [14:28:40] curl 'https://puppetserver1001.eqiad.wmnet:8140/' is successful from a cert PoV (getting 404) [14:30:43] the first cert error seems to appear at 2025-04-17T00 [14:30:58] I'm happy to take a look as well federico3 [14:31:16] Hey Jesse o/ [14:31:20] please go ahead :) [14:31:24] nod [14:31:27] will do! [14:31:54] I am wondering if this happens for other hosts as well, may be a sign of puppetserver reaching max capacity? [14:32:07] could be [14:32:10] (could it be that the certs has been rotated and the host is using an old CA cert?) [14:36:07] we are not rotating certs that frequently IIRC, and I'd expect errors to happen only once in a while [14:36:14] this one seems more consistent [14:44:31] this is the hourly error count https://phabricator.wikimedia.org/P75448 [14:44:41] it appears to be only and issue with puppetserver1001.eqiad.wmnet [14:44:44] *an [14:45:08] with 1002 & 1003 it works fine [15:30:44] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10765458 (10Jgreen) 05Invalid→03Resolved [16:24:36] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [16:24:37] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [16:27:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [16:34:25] FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [17:49:37] jhathaway: should i open a task to track this? [17:50:26] sure, I think I have the cause figured out, but there are a few broken pieces [18:21:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:23:12] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627 (10jhathaway) 03NEW [19:23:57] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766249 (10jhathaway) p:05Triage→03Medium a:03jhathaway [19:24:43] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766252 (10jhathaway) Only occurs on puppetserver1001.eqiad.wmnet, cert was revoked on April 14th: ` puppetserver-2025-04-14.0.log.gz:2025-04-14T07:26:35.169Z INFO [qtp1905171892-12616218] [p.p.certificate-authority] Rev... [19:26:06] 07Puppet: sync-puppet-ca timer broken - https://phabricator.wikimedia.org/T392628 (10jhathaway) 03NEW [19:26:18] 07Puppet: sync-puppet-ca timer broken - https://phabricator.wikimedia.org/T392628#10766268 (10jhathaway) p:05Triage→03Medium [19:30:19] 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629 (10jhathaway) 03NEW [19:30:27] 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629#10766295 (10jhathaway) p:05Triage→03Medium [20:24:36] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [20:24:37] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [20:34:25] FIRING: [3x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:25] FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:25] FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:50] 07Puppet: non-ca puppetservers do not check the ca certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637 (10jhathaway) 03NEW [21:21:25] 07Puppet: non-ca puppetservers do not check the ca certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637#10766678 (10jhathaway) p:05Triage→03Medium [21:42:28] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766729 (10jhathaway) 05Open→03Resolved I opened subtasks for the issues discovered when looking at this issue, the server certificate itself has been regenerated, however why the cert was revoked in the first plac... [21:42:54] 07Puppet: Non-ca puppetservers do not check the CA certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637#10766732 (10jhathaway) [22:21:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure