[00:20:47] 10Packaging, 06Abstract Wikipedia team, 10function-evaluator, 06Infrastructure-Foundations, 03Abstract Wikipedia Fix-It tasks: Package rustc from forky for wikimedia-bookworm so we can use it in an image like abstractwiki-rust - https://phabricator.wikimedia.org/T425341#11934039 (10Jdforrester-WMF) ... [02:51:10] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:51:10] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:32:25] 10Mail, 06Infrastructure-Foundations, 06Product Safety and Integrity, 06SRE, and 2 others: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11934418 (10kostajh) 05Open→03Resolved >>! In T426105#11933806, @jhathaway wrote: > Someone from Yahoo was kind enough to reach out to me... [06:56:10] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:56:10] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:26] 10SRE-tools, 06Infrastructure-Foundations, 06SRE Observability: sre.kafka.roll-restart-reboot-brokers: command-config is not a recognized option - https://phabricator.wikimedia.org/T426639#11935096 (10elukey) 05Open→03Resolved a:03elukey [10:56:52] moritzm: o/ I am planning to decom pki1001 later on if you are ok (https://phabricator.wikimedia.org/T426739) [10:57:19] sounds good! [11:01:10] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:56] Hello. I'm having a persistent problem with a reimage. Has anyone got time to help me look at it, please? The ticket is T425653 and the host I'm trying to reimage at the moment is dse-k8s-wdqs-test1001. [14:30:56] T425653: Add the wdqs::alternative nodes to the dse-k8s clusters - with a taint to avoid normal jobs being scheduled - https://phabricator.wikimedia.org/T425653 [14:32:43] The symptoms are that PXE boot works and the Debian installer completes successfully. Upon reboot, the cookbook cannot poll for the uptime. [14:33:20] I also tried with a `--no-pxe` with the freshly installed bookworm host, but it crashes out quickly. [14:34:23] Maybe this is something related to the fact that I renamed it from wdqs1028 to dse-k8s-wdqs-test1001 (with the `sre.hosts.rename` cookbook) [14:46:50] btullis: o/ [14:46:55] sorry just seen it, lemme check [14:47:16] Thanks. I'm just trying again with trixie, to see if it makes a difference. [14:47:43] I have tried upgrading the NIC firmware and the BIOS, with no effect. [14:47:49] just to understand - the host does boot in debian, but the cookbook doesn't recognize it? [14:47:55] or it fails to boot? [14:48:25] That's right. It boots after the installer has finished an rebooted. [14:48:57] I believe that the host is pingable by its IP address. [14:50:45] It's just running through partitioning at the moment in a trixe installer. It should reboot in a few minutes and we can see whether or not trixie fixes it. [14:50:53] ok I am watching it via mgmt console [14:58:01] It has pxe booted again. [14:58:02] btullis: it is again in debian install [14:58:05] yeah [14:58:22] is it a new node? [14:58:58] mmm no super old [14:59:13] I have give a chassis power reset from the IPMI. [14:59:21] Yes, 7 years old. [14:59:42] Hardware given to wikidata platform team to help them test wdqs_v2. [15:00:50] OK, now booting locally. [15:01:10] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:08] It's now in the same state as before. Host has booted, but the cookbook can't detect the reboot. [15:06:48] checking [15:10:03] I think that the double reimage messes up things, I cannot root-login from neither install-console nor mgmt's root promt [15:10:08] *prompt [15:11:22] so cumin cannot really do much when verifying if the host is up [15:12:00] btullis: I think that we should probably upgrade the idrac and bios firmwares, re-provision and retry [15:12:19] chances are that the host is super hold and it is not compatible with what we do now in reimage [15:13:09] Yes, I noticed that the root password didn't work. That's consistent across attempts. I have already upgraded the BIOS, but I'll try the iDrac too. [15:13:55] yep I highly suggest that, and after it you can run the provision cookbook with --no-dhcp --no-users --no-switch [15:19:53] jhathaway: nice find on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289089 again. seems like we have a few other classes named 'ferm' as well, so wondering if we should change the tag to be more specific instead of renaming those [15:22:27] yeah, I started looking at that [15:22:54] I think the two that matter are, ./modules/role/manifests/mariadb/ferm.pp and ./modules/profile/manifests/mariadb/ferm.pp [15:22:59] but there could be others [15:23:13] the role one is trivial, I have a patch I can push [15:23:27] the profile one is a define type, and would require more careful effort [15:24:57] it also still has a ferm::service, which was not converted to a firewall service, but I'm not sure why [15:25:44] `git ls-files | grep '/ferm\.pp$'` shows those and modules/profile/manifests/docker/ferm.pp modules/profile/manifests/firewall/log/ferm.pp as well [15:26:14] yeah, but those two are ferm specific, i.e. they don't use the wrappers, so I think they would not have any affect [15:27:01] yeah, although they may be hit by the same thing if/when they're converted, since this is not an obvious thing to fail on [15:27:19] but if you have a patch for the mariadb ones, let's do that [15:28:16] I'll push my role patch, the `define profile::mariadb::ferm` will take some more work [15:28:39] Amir and myself are currently in the process of moving the mariadb firewall definitions to firewall::service, if there's anything which should get expedited, let me know [15:30:58] moritzm: oh great, are you working specifically on `define profile::mariadb::ferm`? [15:32:32] the last patches were for role::mariadb::ferm ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289171 ) [15:32:43] which is grossly misnamed and should really be a profile... [15:33:16] IIRC profile::mariadb::ferm is mostly used on the dbproxies and current focus was on the main DB nodes [15:33:22] but we can aim at fixing these next [15:33:35] ok, for the role one we just need to rename [15:34:05] for profile::mariadb::ferm we need to rename and update the existing ferm::service to a firewall::service [15:34:17] happy to give that a go and push a patch for review [15:34:36] moritzm: basically the issue we have is that https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211651 does not work if there is a class/define name ending with ::ferm that uses the firewall::* wrappers, thanks to puppet tags being cursed [15:36:42] jhathaway: sounds good, happy to review (and please also add Amir) [15:36:58] I'm not even sure what we use these tags for? [15:37:25] this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211651 [15:37:42] switching between nftables and ferm using a collector [15:38:12] ah, ok [16:37:20] FIRING: PfwCoreBGPDown: ... [16:37:25] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [16:57:20] RESOLVED: PfwCoreBGPDown: ... [16:57:25] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:32:20] FIRING: PfwCoreBGPDown: ... [17:32:25] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:37:11] RESOLVED: PfwCoreBGPDown: ... [17:37:11] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:40:11] FIRING: PfwCoreBGPDown: ... [17:40:11] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:50:20] RESOLVED: PfwCoreBGPDown: ... [17:50:25] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [19:01:10] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:01:10] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:06:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:11:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:56] FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:22:11] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:25:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:57:33] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10VPS-project-Phabricator: @wikimedia.org email addresses don't seem to be receiving emails sent by the test Phabricator instance - https://phabricator.wikimedia.org/T422559#11938383 (10jhathaway) The issue is that on `mx-in{1001,2001}.wikimed... [23:06:10] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:06:10] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed