[00:04:55] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:55] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074287 (owner: 10TrainBranchBot) [00:24:14] (03PS1) 10DErenrich: Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074303 [00:25:25] (03CR) 10CI reject: [V:04-1] Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074303 (owner: 10DErenrich) [00:30:09] (03PS1) 10DErenrich: Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311 [00:30:44] (03Abandoned) 10DErenrich: Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074303 (owner: 10DErenrich) [00:36:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:37:23] PROBLEM - Work requests waiting in Zuul Gearman server on contint1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [00:41:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:02:23] RECOVERY - Work requests waiting in Zuul Gearman server on contint1002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [01:10:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10163187 (10phaultfinder) [02:38:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:55] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:55] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:47:02] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to logstash for Jeremyb - https://phabricator.wikimedia.org/T374406#10163233 (10jeremyb) 05Open→03Declined put this on hold for now. Will come back if I have a more solid plan for how I would use it regularly. [03:51:31] (03Abandoned) 10Wangombe: Update reference to ElasticSearchTtmServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054869 (https://phabricator.wikimedia.org/T335342) (owner: 10Wangombe) [04:04:55] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:55] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:55] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:05:16] (03CR) 10Gergő Tisza: [C:03+1] ClosedWikiProvider: Support canAlwaysAutocreate option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074257 (https://phabricator.wikimedia.org/T374987) (owner: 10Ebernhardson) [05:21:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:23:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:23:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52631 bytes in 5.004 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:24:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 5.012 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240920T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:25:07] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10163278 (10phaultfinder) [06:27:59] (03CR) 10DCausse: [C:03+1] ClosedWikiProvider: Support canAlwaysAutocreate option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074257 (https://phabricator.wikimedia.org/T374987) (owner: 10Ebernhardson) [06:46:52] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:52] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240920T0700) [07:16:25] FIRING: SystemdUnitFailed: load-dcatap-weekly.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:18:05] !log rolling upgrade of purged on magru, drmrs, esams and eqiad - T334078 [07:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:09] T334078: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 [07:22:58] (03PS6) 10JMeybohm: ferm: Use ferm-status to start ferm on diffs [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) [07:23:31] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [07:26:25] FIRING: [2x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:27:21] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [07:28:13] !log jelto@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [07:30:40] (03PS1) 10Muehlenhoff: icinga: Enable profile::auto_restarts::service for keyholder-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1074358 (https://phabricator.wikimedia.org/T135991) [07:30:50] (03PS2) 10Muehlenhoff: icinga: Enable profile::auto_restarts::service for keyholder-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1074358 (https://phabricator.wikimedia.org/T135991) [07:31:25] FIRING: [3x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:33:01] !log rebalance ganeti group D following the various switch maintenances T370630 [07:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:05] T370630: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630 [07:34:00] (03PS7) 10JMeybohm: ferm: Use ferm-status to start ferm on diffs [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) [07:41:23] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [07:41:25] FIRING: [6x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:11] (03PS1) 10Filippo Giunchedi: thanos: trim 5m retention to 15w [puppet] - 10https://gerrit.wikimedia.org/r/1074360 (https://phabricator.wikimedia.org/T351927) [07:48:36] (03CR) 10Filippo Giunchedi: [C:03+1] icinga: Enable profile::auto_restarts::service for keyholder-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1074358 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:49:33] (03CR) 10Arnaudb: "afair @ladsgroup@gmail.com has a good order for schema updates which can be reused here" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans) [07:50:56] !log T375085 testing mtail 3.0.9 using debian testing package on centrallog2002 [07:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:00] T375085: mtail 3.0.0~rc50-1+b6 leaks memory on centrallog2002 - https://phabricator.wikimedia.org/T375085 [07:56:25] FIRING: [7x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:16] (03CR) 10Muehlenhoff: "Looks good, a few comments/nits inline." [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [08:01:25] FIRING: [8x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:03] (03PS8) 10JMeybohm: ferm: Use ferm-status to start ferm on diffs [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) [08:12:12] (03CR) 10JMeybohm: ferm: Use ferm-status to start ferm on diffs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [08:14:27] (03PS1) 10Alexandros Kosiaris: Add various .wikimedia.org domains to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074365 (https://phabricator.wikimedia.org/T374997) [08:14:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:01] (03CR) 10Jelto: "Thanks for mentioning this parameter, this seems to be what I want." [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [08:26:02] (03CR) 10Muehlenhoff: ferm: Use ferm-status to start ferm on diffs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [08:27:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10163445 (10aborrero) thanks! [08:29:12] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10163458 (10aborrero) [08:30:27] (03CR) 10JMeybohm: ferm: Use ferm-status to start ferm on diffs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [08:31:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [08:31:56] PROBLEM - MegaRAID on es1022 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:31:58] ACKNOWLEDGEMENT - MegaRAID on es1022 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T375257 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:32:04] (03PS1) 10Filippo Giunchedi: grafana: remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/1074368 (https://phabricator.wikimedia.org/T321808) [08:32:35] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [08:32:37] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257 (10ops-monitoring-bot) 03NEW [08:34:52] (03PS1) 10Filippo Giunchedi: librenms: remove obsolete checks [puppet] - 10https://gerrit.wikimedia.org/r/1074369 (https://phabricator.wikimedia.org/T321808) [08:36:29] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10163495 (10ABran-WMF) a:03ABran-WMF [08:37:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool es1022 - T375257', diff saved to https://phabricator.wikimedia.org/P69377 and previous config saved to /var/cache/conftool/dbconfig/20240920-083722-arnaudb.json [08:37:26] T375257: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257 [08:38:59] (03CR) 10JMeybohm: [C:03+2] ferm: Use ferm-status to start ferm on diffs [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [08:39:09] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10163503 (10ABran-WMF) 05Open→03In progress p:05Triage→03High This instance has been depooled, it's ready to be handled [08:47:38] (03CR) 10Giuseppe Lavagetto: [C:03+1] "+1 because this fixes the problem, btu maybe we can come up with a better solution for favicons and serve them statically?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074365 (https://phabricator.wikimedia.org/T374997) (owner: 10Alexandros Kosiaris) [08:47:56] (03PS1) 10JMeybohm: ferm: Fix systemd override to not append ExecReload [puppet] - 10https://gerrit.wikimedia.org/r/1074371 (https://phabricator.wikimedia.org/T374366) [08:49:32] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1074371 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [08:51:22] (03CR) 10JMeybohm: [C:03+2] ferm: Fix systemd override to not append ExecReload [puppet] - 10https://gerrit.wikimedia.org/r/1074371 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [08:53:48] RECOVERY - Host analytics1076 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [08:54:59] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop analytics cluster [08:55:41] (03CR) 10Muehlenhoff: [C:03+2] Switch chartmuseum to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1073414 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [09:03:03] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10163581 (10MoritzMuehlenhoff) [09:05:07] (03CR) 10Ayounsi: [C:03+1] "thx!" [puppet] - 10https://gerrit.wikimedia.org/r/1074369 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [09:11:10] 06SRE, 06Infrastructure-Foundations, 10netops: netbox: create IPv6 entries for Cloud VPS - https://phabricator.wikimedia.org/T374712#10163608 (10cmooney) I've added the network containers as discussed on the previous task in Netbox now: https://netbox.wikimedia.org/ipam/prefixes/?within_include=2a02%3Aec80%... [09:14:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me, although I can't make any sense of the CI failure. Do we need to maybe merge the envoy spec change ahead of the rest of " [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [09:19:37] (03CR) 10Muehlenhoff: [C:03+2] idp::build: Remove duplicate repository config [puppet] - 10https://gerrit.wikimedia.org/r/1073788 (owner: 10Muehlenhoff) [09:32:07] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cloudsw1-b1-codfw [09:34:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-b1-codfw [09:35:54] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cloudsw1-e4-eqiad [09:38:11] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-e4-eqiad [09:39:33] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10163697 (10ABran-WMF) >>! In T372943#10... [09:40:01] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cloudsw1-f4-eqiad [09:42:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-f4-eqiad [09:42:54] 06SRE, 06Infrastructure-Foundations, 06serviceops: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296#10163701 (10JMeybohm) This came up during {T374366} - currently it is not possible to disable `defs_from_etcd` in a clean way. [09:47:54] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10163710 (10ayounsi) 05Open→03Resolved Certs regenerated, so we're good for the next 12 months. Hopefully we will setup automation by then :) [09:50:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [09:54:10] (03PS1) 10Muehlenhoff: openssh: Remove code to disable NIST key exchange [puppet] - 10https://gerrit.wikimedia.org/r/1074381 [09:57:43] (03PS1) 10Btullis: Enable the rclone backup schedule on db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074382 (https://phabricator.wikimedia.org/T372908) [10:01:08] (03PS1) 10Ayounsi: sre.network.tls: start from scratch if CSR is missing [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) [10:03:11] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-e1-eqiad [10:03:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e1-eqiad [10:06:58] (03PS2) 10Ayounsi: sre.network.tls: start from scratch if CSR is missing [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) [10:07:07] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-e2-eqiad [10:07:15] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e2-eqiad [10:07:48] (03PS3) 10Ayounsi: sre.network.tls: start from scratch if CSR is missing [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) [10:07:57] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-e3-eqiad [10:08:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e3-eqiad [10:08:50] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f2-eqiad [10:08:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f2-eqiad [10:10:59] (03PS4) 10Ayounsi: sre.network.tls: start from scratch if CSR is missing [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) [10:11:32] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f3-eqiad [10:11:40] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f3-eqiad [10:16:41] (03PS3) 10Slyngshede: Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162 [10:18:58] (03PS2) 10Btullis: Enable the rclone backup schedule on db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074382 (https://phabricator.wikimedia.org/T372908) [10:19:11] (03CR) 10Filippo Giunchedi: [C:03+2] librenms: remove obsolete checks [puppet] - 10https://gerrit.wikimedia.org/r/1074369 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [10:19:17] (03PS2) 10Filippo Giunchedi: librenms: remove obsolete checks [puppet] - 10https://gerrit.wikimedia.org/r/1074369 (https://phabricator.wikimedia.org/T321808) [10:19:27] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:19:33] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] librenms: remove obsolete checks [puppet] - 10https://gerrit.wikimedia.org/r/1074369 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [10:19:42] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4065/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074382 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [10:23:01] (03CR) 10CI reject: [V:04-1] sre.network.tls: start from scratch if CSR is missing [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) (owner: 10Ayounsi) [10:23:12] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:29:22] (03PS5) 10Ayounsi: sre.network.tls: start from scratch if CSR is missing [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) [10:33:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:40:09] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10163774 (10phaultfinder) [10:41:31] (03PS1) 10Btullis: Add a datahubsearch cluster and assign the correct hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074389 (https://phabricator.wikimedia.org/T374932) [10:42:14] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4066/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074389 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [10:45:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162 (owner: 10Slyngshede) [10:46:55] (03CR) 10Stevemunene: [C:03+1] Add a datahubsearch cluster and assign the correct hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074389 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [10:51:34] (03PS1) 10Btullis: Add an airflow cluster and assign relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074391 (https://phabricator.wikimedia.org/T374932) [10:54:18] (03CR) 10Santiago Faci: hieradata::services_proxy::envoy.yaml: fix duplicated port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [10:55:46] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 7 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1074391 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240920T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240920T1100). Please do the needful. [11:09:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1073460 (owner: 10Slyngshede) [11:11:34] !log installing nano updates from Bullseye point update [11:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:24] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10163797 (10MoritzMuehlenhoff) [11:18:18] !log installing links2 updates from Bullseye point release [11:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:02] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10163800 (10MoritzMuehlenhoff) [11:30:57] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10163804 (10MoritzMuehlenhoff) [11:37:02] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:40:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, I also tested that this works on Buster to Bookworm." [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1060789 (https://phabricator.wikimedia.org/T216832) (owner: 10Hashar) [11:40:33] (03CR) 10Muehlenhoff: [C:03+2] Do not use a login shell when dropping privileges [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1060789 (https://phabricator.wikimedia.org/T216832) (owner: 10Hashar) [11:46:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wikikube-worker2092.codfw.wmnet with reason: Degraded RAID [11:46:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker2092.codfw.wmnet with reason: Degraded RAID [11:46:26] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10163828 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ae34a38f-1cf6-4321-bfa2-45c744f8ff06) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host... [11:49:02] (03CR) 10Arnaudb: [C:03+1] sre.switchdc.databases.prepare: add check [cookbooks] - 10https://gerrit.wikimedia.org/r/1074127 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [11:51:51] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2129.codfw.wmnet - https://phabricator.wikimedia.org/T375207#10163837 (10ABran-WMF) sorry! forgot to be explicit about it! this host is ready to be handled! [11:57:26] !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2424.codfw.wmnet [11:58:05] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2424.codfw.wmnet [11:58:52] (03PS1) 10Santiago Faci: Metrics Platform monotable: Base stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) [11:59:20] (03PS9) 10Gmodena: dse-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) [11:59:33] (03CR) 10CI reject: [V:04-1] Metrics Platform monotable: Base stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) (owner: 10Santiago Faci) [12:01:25] FIRING: [8x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:22] (03PS1) 10Muehlenhoff: Bump changelog [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1074397 (https://phabricator.wikimedia.org/T216832) [12:05:36] (03CR) 10Gmodena: Declare stream 'mediawiki.dump.revision_history.reconcile.v1.rc0' (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo) [12:07:04] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2425.codfw.wmnet with reason: reimage [12:07:18] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2425.codfw.wmnet with reason: reimage [12:08:40] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump changelog [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1074397 (https://phabricator.wikimedia.org/T216832) (owner: 10Muehlenhoff) [12:09:55] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:05] !log jiji@cumin1002 START - Cookbook sre.hosts.provision for host mw2424.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [12:11:32] PROBLEM - MD RAID on mw2424 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:11:34] ACKNOWLEDGEMENT - MD RAID on mw2424 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T375270 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:11:41] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2424 - https://phabricator.wikimedia.org/T375270 (10ops-monitoring-bot) 03NEW [12:14:42] FIRING: KubernetesRsyslogDown: rsyslog on mw2424:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2424 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:14:55] FIRING: [8x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:58] !log jiji@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2424.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [12:17:59] go [12:18:29] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: trim 5m retention to 15w [puppet] - 10https://gerrit.wikimedia.org/r/1074360 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [12:18:30] !log jiji@cumin1002 START - Cookbook sre.hosts.provision for host mw2424.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [12:19:55] FIRING: SystemdUnitFailed: prometheus-puppet-agent-stats.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:49] !log jiji@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2424.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [12:29:48] PROBLEM - Host mw2424 is DOWN: PING CRITICAL - Packet loss = 100% [12:29:55] RESOLVED: SystemdUnitFailed: prometheus-puppet-agent-stats.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:30] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:30:32] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:30:54] (03CR) 10Filippo Giunchedi: [C:03+2] grafana: remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/1074368 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [12:32:18] RECOVERY - Host mw2424 is UP: PING OK - Packet loss = 0%, RTA = 30.45 ms [12:32:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 295, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:32:30] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 377, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:34:40] RESOLVED: KubernetesRsyslogDown: rsyslog on mw2424:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2424 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:35:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10163923 (10MoritzMuehlenhoff) Ganeti row C and D have been rebalanced. [12:35:29] (03PS1) 10JMeybohm: ferm: Use ferm-status to restart ferm on wikikube-staging [puppet] - 10https://gerrit.wikimedia.org/r/1074404 (https://phabricator.wikimedia.org/T374366) [12:36:22] !log jiji@cumin1002 START - Cookbook sre.hosts.provision for host mw2424.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [12:37:57] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2425.codfw.wmnet with reason: reimage [12:38:00] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2425.codfw.wmnet with reason: reimage [12:38:50] PROBLEM - Host mw2424 is DOWN: PING CRITICAL - Packet loss = 100% [12:38:52] (03PS1) 10JMeybohm: ferm: Make reload via ferm-status the default [puppet] - 10https://gerrit.wikimedia.org/r/1074405 (https://phabricator.wikimedia.org/T374366) [12:39:30] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:39:34] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:42:33] FIRING: KubernetesCalicoDown: mw2424.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2424.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:42:54] RECOVERY - Host mw2424 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [12:43:22] (03PS1) 10Filippo Giunchedi: uwsgi: remove icinga-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1074409 (https://phabricator.wikimedia.org/T375271) [12:43:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 295, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:43:32] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 377, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:43:59] (03CR) 10CI reject: [V:04-1] uwsgi: remove icinga-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1074409 (https://phabricator.wikimedia.org/T375271) (owner: 10Filippo Giunchedi) [12:45:08] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2424.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [12:47:22] (03PS1) 10Effie Mouzeli: kubernetes: mw2313 -> wikikube-worker2124 [puppet] - 10https://gerrit.wikimedia.org/r/1074410 (https://phabricator.wikimedia.org/T372878) [12:47:33] RESOLVED: KubernetesCalicoDown: mw2424.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2424.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:48:06] (03PS2) 10Effie Mouzeli: kubernetes: rename mw2313 -> wikikube-worker2124 [puppet] - 10https://gerrit.wikimedia.org/r/1074410 (https://phabricator.wikimedia.org/T372878) [12:48:34] (03PS2) 10Filippo Giunchedi: uwsgi: remove icinga-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1074409 (https://phabricator.wikimedia.org/T375271) [12:48:45] (03CR) 10EoghanGaffney: [C:03+2] contint: remove jdk-11 packages [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [12:50:46] (03PS1) 10JMeybohm: kafka::broker: Add the external-services DNS name to the certs [puppet] - 10https://gerrit.wikimedia.org/r/1074411 (https://phabricator.wikimedia.org/T374729) [12:51:10] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074411 (https://phabricator.wikimedia.org/T374729) (owner: 10JMeybohm) [12:51:32] RECOVERY - MD RAID on mw2424 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:53:22] (03PS3) 10Effie Mouzeli: kubernetes: rename mw2313 -> wikikube-worker2124 [puppet] - 10https://gerrit.wikimedia.org/r/1074410 (https://phabricator.wikimedia.org/T372878) [12:55:37] (03CR) 10Filippo Giunchedi: "I ran an audit for uwsgi::app hosts with monitoring enabled:" [puppet] - 10https://gerrit.wikimedia.org/r/1074409 (https://phabricator.wikimedia.org/T375271) (owner: 10Filippo Giunchedi) [12:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:58:02] (03CR) 10Kamila Součková: [C:03+1] "+1 except see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1074410 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [12:58:03] (03CR) 10Alexandros Kosiaris: [C:04-1] "Change LGTM, commit message is wrong. Amend and merge!" [puppet] - 10https://gerrit.wikimedia.org/r/1074410 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [12:59:05] (03CR) 10Hashar: "I have `apt purge` the JDK 11 packages." [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [13:01:14] (03PS4) 10Effie Mouzeli: kubernetes: rename mw2424 -> wikikube-worker2124 [puppet] - 10https://gerrit.wikimedia.org/r/1074410 (https://phabricator.wikimedia.org/T372878) [13:01:42] (03CR) 10Effie Mouzeli: kubernetes: rename mw2424 -> wikikube-worker2124 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074410 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [13:02:38] (03PS1) 10Fabfur: Renamed log field for pipeline migration (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) [13:09:24] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on mw2425.codfw.wmnet with reason: reimage [13:09:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on mw2425.codfw.wmnet with reason: reimage [13:09:30] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10164005 (10Clement_Goubert) [13:12:25] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1074404 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [13:13:04] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832#10164013 (10hashar) a:03hashar [13:14:43] (03CR) 10JMeybohm: [C:03+1] kubernetes: rename mw2424 -> wikikube-worker2124 [puppet] - 10https://gerrit.wikimedia.org/r/1074410 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [13:15:13] (03CR) 10JMeybohm: [C:03+2] ferm: Use ferm-status to restart ferm on wikikube-staging [puppet] - 10https://gerrit.wikimedia.org/r/1074404 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [13:17:17] (03PS1) 10Bartosz Dziewoński: SSO domain shouldn't have a mobile version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) [13:17:58] (03CR) 10CI reject: [V:04-1] SSO domain shouldn't have a mobile version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński) [13:20:51] !log uploaded debmonitor-client 0.4.0-3 for buster/bullseye/bookworm to apt.wikimedia.org T216832 [13:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:55] T216832: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832 [13:21:43] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832#10164040 (10MoritzMuehlenhoff) I've uploaded updated debs with the patch, will be rolled out next week. [13:33:59] (03CR) 10Ssingh: [C:03+1] geo-maps: update map default to list codfw first [dns] - 10https://gerrit.wikimedia.org/r/1073899 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [13:34:43] (03CR) 10Ssingh: [C:03+1] wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1073900 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [13:35:09] (03CR) 10Ssingh: [C:03+1] wmnet: update CNAME record for maintenance host to codfw [dns] - 10https://gerrit.wikimedia.org/r/1073898 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [13:35:17] (03PS1) 10Muehlenhoff: bacula::storage: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074427 [13:35:52] (03PS2) 10Bartosz Dziewoński: SSO domain shouldn't have a mobile version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) [13:36:33] (03CR) 10CI reject: [V:04-1] SSO domain shouldn't have a mobile version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński) [13:38:19] (03CR) 10Ssingh: [C:03+1] "(not verified the list of hostnames from dbconfig)" [dns] - 10https://gerrit.wikimedia.org/r/1073897 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [13:38:25] (03PS3) 10Bartosz Dziewoński: SSO domain shouldn't have a mobile version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) [13:39:51] (03PS1) 10Muehlenhoff: mw_rc_irc: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074429 [13:40:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074427 (owner: 10Muehlenhoff) [13:40:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074429 (owner: 10Muehlenhoff) [13:47:02] (03PS1) 10Btullis: Add a presto cluster and assign the relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074430 (https://phabricator.wikimedia.org/T374932) [13:47:49] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4068/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074430 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [13:59:46] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10164107 (10MoritzMuehlenhoff) [14:03:57] !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@6010568]: automoderator config dag T375062 [14:04:02] T375062: ETL pipeline to update Automoderator config (weekly) - https://phabricator.wikimedia.org/T375062 [14:04:43] !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@6010568]: automoderator config dag T375062 (duration: 01m 47s) [14:08:25] !log Running `foreachwiki maintenance/fixAutoblockLogTitles.php` on a tmux session for T373929 [14:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:30] T373929: TypeError: Argument 1 passed to LogFormatter::makeUserLink() must implement interface MediaWiki\User\UserIdentity, bool given - https://phabricator.wikimedia.org/T373929 [14:11:17] (03PS3) 10Bking: WIP: clean up Elastic runbook links post-doc rewrite [alerts] - 10https://gerrit.wikimedia.org/r/1074247 (https://phabricator.wikimedia.org/T356806) [14:11:33] (03PS4) 10Bking: Clean up Elastic runbook links post-doc rewrite [alerts] - 10https://gerrit.wikimedia.org/r/1074247 (https://phabricator.wikimedia.org/T356806) [14:12:27] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#10164116 (10ssingh) Thanks @Scott_French for documenting this by filing this task! I had a brief chat about this with @volans yesterday on IRC.... [14:13:54] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10164117 (10Jhancock.wm) @Dwisehaupt I ran out of cycles yesterday and didn't get to this task. so sorry about that. Can you downtime it again? and I... [14:20:37] !log Finished running`foreachwiki maintenance/fixAutoblockLogTitles.php` on a tmux session for T373929 [14:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:41] T373929: TypeError: Argument 1 passed to LogFormatter::makeUserLink() must implement interface MediaWiki\User\UserIdentity, bool given - https://phabricator.wikimedia.org/T373929 [14:22:44] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2129.codfw.wmnet - https://phabricator.wikimedia.org/T375207#10164141 (10Jhancock.wm) 05Open→03Resolved [14:23:56] (03PS1) 10JHathaway: vrts_aliases: add a basic safeguard [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) [14:26:05] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1074434 [14:26:37] (03PS4) 10Santiago Faci: Metrics Platform monotable: Base stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) [14:26:56] (03CR) 10CI reject: [V:04-1] Metrics Platform monotable: Base stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) (owner: 10Santiago Faci) [14:29:31] (03PS16) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [14:30:17] (03PS5) 10Santiago Faci: Metrics Platform monotable: Base stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) [14:30:50] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10164160 (10Dwisehaupt) @jhancock.wm It's all ready for you. [14:31:58] (03CR) 10Ebernhardson: [C:03+1] "Links all go to an existing place and look relevant to the alert" [alerts] - 10https://gerrit.wikimedia.org/r/1074247 (https://phabricator.wikimedia.org/T356806) (owner: 10Bking) [14:32:04] (03CR) 10Ebernhardson: [C:03+2] Clean up Elastic runbook links post-doc rewrite [alerts] - 10https://gerrit.wikimedia.org/r/1074247 (https://phabricator.wikimedia.org/T356806) (owner: 10Bking) [14:32:30] (03CR) 10Ssingh: "Looks good! Let's test it out on a cumin host to see a dry-run and then let's finalize it:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [14:33:15] (03Merged) 10jenkins-bot: Clean up Elastic runbook links post-doc rewrite [alerts] - 10https://gerrit.wikimedia.org/r/1074247 (https://phabricator.wikimedia.org/T356806) (owner: 10Bking) [14:34:19] (03PS2) 10CDobbins: sre.dns.roll-restart-haproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1074266 (https://phabricator.wikimedia.org/T375232) [14:37:25] (03PS17) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [14:38:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:53] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10164179 (10Jhancock.wm) @Dwisehaupt moved and powering up. Let me know if anything looks amiss. [14:41:16] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10164180 (10Jhancock.wm) [14:41:38] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10164181 (10MoritzMuehlenhoff) [14:44:42] RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:49:12] (03PS1) 10Ayounsi: Add gRPC checks to network devices [puppet] - 10https://gerrit.wikimedia.org/r/1074435 [14:49:59] (03PS8) 10Herron: thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586 [14:49:59] (03CR) 10Herron: "makes sense thanos, updated this to point at localhost and also rebased on a patch to deploy the otel collector to titan hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron) [14:50:15] (03PS2) 10Herron: titan: add opentelemetry collector [puppet] - 10https://gerrit.wikimedia.org/r/1074434 [14:50:28] (03PS2) 10Ayounsi: Add gRPC checks to network devices [puppet] - 10https://gerrit.wikimedia.org/r/1074435 [14:50:33] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [14:51:14] (03CR) 10Herron: "*makes sense THANKS" [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron) [14:51:37] (03CR) 10Ssingh: "Test run looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [14:55:04] (03CR) 10Ssingh: "See https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1073290/comments/4b104d3d_74611ca1 on merging this and the pdns-recursor cookb" [cookbooks] - 10https://gerrit.wikimedia.org/r/1074266 (https://phabricator.wikimedia.org/T375232) (owner: 10CDobbins) [14:55:14] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832#10164221 (10hashar) I have upgraded the client on contint1002.wikimedia.org. The warning no more appears and I can see it upgraded in the Debmonitor web interf... [14:58:44] (03PS6) 10Santiago Faci: Metrics Platform monotable: Base stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) [14:59:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:09] (03PS3) 10Ayounsi: Add gRPC checks to network devices [puppet] - 10https://gerrit.wikimedia.org/r/1074435 [15:01:19] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [15:06:35] will sign off, nothing to report for oncall! [15:06:40] (03PS4) 10Ayounsi: Add gRPC checks to network devices [puppet] - 10https://gerrit.wikimedia.org/r/1074435 [15:06:47] <_Gerges> sorry, I don't know where to raise this issue, I think this channel is the most appropriate https://usercontent.irccloud-cdn.com/file/1EpskAzL/2024-08-26-16-54-phabricator.wikimedia.org.png [15:06:59] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [15:07:10] (03Abandoned) 10Ayounsi: WIP: add gNMI (+cert) check for network devices [puppet] - 10https://gerrit.wikimedia.org/r/948553 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [15:07:57] _Gerges: Possibly https://phabricator.wikimedia.org/T362401 [15:08:42] <_Gerges> Access Denied: Restricted Task [15:08:58] ah.. hrm.. I'll add a complaint on your behalf. [15:09:21] If you're willing, please send me a private message with your IP address for the record. [15:09:23] <_Gerges> Thank you [15:10:24] <_Gerges> The ip address is in the image :(, can my message be deleted? [15:11:01] (03CR) 10Ssingh: [C:03+1] "Just for clarity: this still requires an authdns-update run. The depool cookbook (sre.dns.admin) does not. I have updated the docs on the " [dns] - 10https://gerrit.wikimedia.org/r/1073899 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [15:11:34] I don’t think the message itself can be deleted :/ but maybe IRCCloud allows you to delete the image (no idea) [15:13:40] _Gerges: I notice that the timestamp on that screen shot shows Mon, 26 Aug 2024 13:52:32 GMT which is suspicious [15:13:53] Today is 20 Sept 2024. [15:15:10] _Gerges: you can delete it from irccloud's website [15:15:16] <_Gerges> I didn't notice, so could it be a problem with the cache? [15:15:38] I just wanna confirm that it's definitely an up-to-date screen shot [15:17:33] _Gerges: Confirming that the screen shot is not available via IRC anymore. [15:18:08] <_Gerges> I deleted it [15:19:25] <_Gerges> I will send you a new screenshot in private messages. [15:19:43] OK. I added a report to the phab ticket. [15:27:45] (03CR) 10CDanis: [C:03+1] titan: add opentelemetry collector [puppet] - 10https://gerrit.wikimedia.org/r/1074434 (owner: 10Herron) [15:29:21] (03CR) 10CDanis: [C:03+1] thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron) [15:33:50] (03PS1) 10Hashar: jenkins: dedupe apt::repository for thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/1074461 [15:34:26] (03CR) 10CI reject: [V:04-1] jenkins: dedupe apt::repository for thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (owner: 10Hashar) [15:37:17] (03PS2) 10Hashar: jenkins: dedupe apt::repository for thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) [15:38:09] (03CR) 10CI reject: [V:04-1] jenkins: dedupe apt::repository for thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [15:44:06] (03PS3) 10Hashar: jenkins: dedupe apt::repository for thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) [15:46:02] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [15:51:44] (03CR) 10Hashar: "The "duplication" comes from: https://gerrit.wikimedia.org/r/c/operations/puppet/+/884887/7..9/modules/jenkins/manifests/init.pp#b100" [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [16:00:21] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10164333 (10Dwisehaupt) @Jhancock.wm pay-lb2001 looks good. Thanks. [16:01:17] !log T375261 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'Abderazack Ahmat Annour' 'Renamed user 5dcfa1a90396fed431e188c9cbe5af6a' [16:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:22] T375261: Unblock stuck global rename of Abderazack Ahmat Annour - https://phabricator.wikimedia.org/T375261 [16:01:40] FIRING: [8x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:46] !log T375261 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'Abderazack Ahmat Annour' 'Renamed user 5dcfa1a90396fed431e188c9cbe5af6a' [16:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:11] !log T375261 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=frwiki --logwiki=metawiki 'Abderazack Ahmat Annour' 'Renamed user 5dcfa1a90396fed431e188c9cbe5af6a' [16:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:28] !log T375261 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=loginwiki --logwiki=metawiki 'Abderazack Ahmat Annour' 'Renamed user 5dcfa1a90396fed431e188c9cbe5af6a' [16:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:50] !log T375261 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'Abderazack Ahmat Annour' 'Renamed user 5dcfa1a90396fed431e188c9cbe5af6a' [16:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:08] !log T375261 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=wikidatawiki --logwiki=metawiki 'Abderazack Ahmat Annour' 'Renamed user 5dcfa1a90396fed431e188c9cbe5af6a' [16:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:13] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - elastic1089 - https://phabricator.wikimedia.org/T374897#10164355 (10VRiley-WMF) [16:08:46] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375130#10164353 (10VRiley-WMF) →14Duplicate dup:03T374897 [16:10:11] (03PS1) 10Hashar: contint: define component/ci only once [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) [16:16:04] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [16:16:34] (03CR) 10Hashar: "I don't think the Puppet compiler will work cause it would not take in account the parent change ( https://gerrit.wikimedia.org/r/c/operat" [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [16:19:36] (03CR) 10Dzahn: contint: define component/ci only once (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [16:21:13] (03PS1) 10Herron: pyrra: liftwing-articlequery-latency invert response_code label [puppet] - 10https://gerrit.wikimedia.org/r/1074475 [16:23:30] (03PS3) 10JHathaway: WIP - puppet8: migrate "easy" puppet facts to structured facts [puppet] - 10https://gerrit.wikimedia.org/r/1074239 [16:23:55] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375281 (10phaultfinder) 03NEW [16:24:08] (03CR) 10Muehlenhoff: "Just use apt::package_from_component, it's the cleaner interface and doesn't have the problem with the duplicated reposirtory to begin wit" [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [16:26:04] (03CR) 10Scott French: "Thank you both again for the discussion. I'm going to go ahead and merge this ahead of next week." [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [16:26:32] (03CR) 10Scott French: [C:03+2] sre.discovery: set timeout in raw dns.query.udp [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [16:33:44] (03CR) 10Hashar: "I guess yes, that would be a nicer approach than having to stick requires / class dependencies all across the manifests. In the parent I2" [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [16:33:55] (03PS2) 10Hashar: contint: define component/ci only once [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) [16:34:36] (03CR) 10CI reject: [V:04-1] contint: define component/ci only once [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [16:34:45] (03CR) 10Dzahn: [C:03+1] "thanks :))" [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [16:35:35] (03PS1) 10Dzahn: gerrit: include gerrit profile in insetup::gerrit for testing [puppet] - 10https://gerrit.wikimedia.org/r/1074477 (https://phabricator.wikimedia.org/T372804) [16:38:29] (03CR) 10Hashar: "Moritz points we should use the `apt::package_from_component` interface which does not suffer from the duplicate `apt::repository` definit" [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [16:39:35] (03PS1) 10Btullis: Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) [16:39:43] (03Merged) 10jenkins-bot: sre.discovery: set timeout in raw dns.query.udp [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [16:40:23] (03PS2) 10Herron: pyrra: liftwing-articlequery-latency invert response_code label [puppet] - 10https://gerrit.wikimedia.org/r/1074475 (https://phabricator.wikimedia.org/T375284) [16:40:24] (03CR) 10CI reject: [V:04-1] Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:40:39] (03PS2) 10Btullis: Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) [16:41:28] (03CR) 10CI reject: [V:04-1] Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:42:56] (03PS3) 10Btullis: Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) [16:43:21] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1064413/4071/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [16:43:41] (03PS3) 10Herron: pyrra: liftwing-articlequery-latency invert response_code label [puppet] - 10https://gerrit.wikimedia.org/r/1074475 (https://phabricator.wikimedia.org/T375284) [16:43:45] (03CR) 10CI reject: [V:04-1] Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:45:19] (03CR) 10Dzahn: [V:03+1 C:03+1] "reference is made to I5c9318727f10a562c93ee where this was added" [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [16:46:24] (03CR) 10Dzahn: [C:03+2] "nowdays this can be removed because we have new gerrit and MINA versions --> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064413" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [16:46:30] (03CR) 10Ssingh: "Makes sense, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [16:47:13] (03CR) 10Herron: [C:03+2] "going ahead with self-merge to unbreak the UI on this SLO" [puppet] - 10https://gerrit.wikimedia.org/r/1074475 (https://phabricator.wikimedia.org/T375284) (owner: 10Herron) [16:50:24] (03PS4) 10Btullis: Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) [16:51:14] (03CR) 10CI reject: [V:04-1] Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:51:49] (03CR) 10Dzahn: [C:03+2] "we are on MINA APACHE-SSHD-2.12.0 and gerrit 3.10.0 as of today." [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [16:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 23.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:53:15] (03CR) 10Dzahn: [V:03+1 C:03+1] "we are on MINA APACHE-SSHD-2.12.0 and gerrit 3.10.0 as of today." [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [16:55:39] (03PS4) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) [16:56:14] (03PS4) 10JHathaway: puppet8: migrate "easy" legacy puppet facts to structured facts [puppet] - 10https://gerrit.wikimedia.org/r/1074239 [16:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:56:43] (03PS5) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) [16:57:53] (03PS1) 10JHathaway: WIP - test [puppet] - 10https://gerrit.wikimedia.org/r/1074482 [16:58:03] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074482 (owner: 10JHathaway) [16:58:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 23.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:01:05] (03CR) 10Muehlenhoff: gerrit: fix todo from 2022, remove nist key setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [17:02:47] (03CR) 10Scott French: "Thanks for all the reviews, and for adding that clarification to the docs. Yeah, in retrospect, I can see how it might be confusing to rea" [dns] - 10https://gerrit.wikimedia.org/r/1073899 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [17:03:40] (03PS2) 10JHathaway: WIP - test [puppet] - 10https://gerrit.wikimedia.org/r/1074482 [17:06:50] (03PS3) 10JHathaway: WIP - test [puppet] - 10https://gerrit.wikimedia.org/r/1074482 [17:11:12] (03PS5) 10Btullis: Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) [17:13:14] (03PS1) 10JHathaway: WIP - ci test [puppet] - 10https://gerrit.wikimedia.org/r/1074486 [17:13:30] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:15:54] (03PS6) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) [17:15:58] (03CR) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [17:16:29] (03PS1) 10JHathaway: rake: squash git merge warning [puppet] - 10https://gerrit.wikimedia.org/r/1074487 [17:17:23] (03CR) 10JHathaway: [C:03+2] rake: squash git merge warning [puppet] - 10https://gerrit.wikimedia.org/r/1074487 (owner: 10JHathaway) [17:18:25] (03CR) 10JHathaway: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [17:18:50] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding logging-hd2004 to codfw - jhancock@cumin2002" [17:18:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding logging-hd2004 to codfw - jhancock@cumin2002" [17:18:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:19:13] (03PS7) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) [17:20:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:21:23] (03PS1) 10Jdlrobson: Promote dark mode for anons on tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T374679) [17:23:04] (03CR) 10JHathaway: "Sorry, the CI failure was dependent on merging I5f45ce5994a0298dfe735426e515ab6cf11ce7f4 to update the CI image to bullseye. The CI image " [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [17:23:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:25:06] (03PS1) 10Dzahn: devtools/hiera: replace legacy facts for puppet 8 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1074491 [17:25:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10164652 (10phaultfinder) [17:25:46] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:29:19] (03PS2) 10Dzahn: devtools/hiera: replace legacy facts for puppet 8 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1074491 [17:29:34] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding logging-hd2005 to codfw - jhancock@cumin2002" [17:29:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding logging-hd2005 to codfw - jhancock@cumin2002" [17:29:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:31:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-hd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:31:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:23] (03PS1) 10Dzahn: gitlab: replace legacy Hiera facts with newer syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074493 [17:34:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:36:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:25] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host logging-hd2004 [17:37:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logging-hd2004 [17:37:38] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host logging-hd2005 [17:37:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logging-hd2005 [17:41:27] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [17:41:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2004.codfw.wmnet with OS bookworm [17:46:07] (03CR) 10Dzahn: [C:04-2] "gotta be careful with this, as this is what adds those service IPs, dont want conflicts" [puppet] - 10https://gerrit.wikimedia.org/r/1074477 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [17:46:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2005.codfw.wmnet with OS bookworm [17:47:58] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[4-5] - https://phabricator.wikimedia.org/T372512#10164743 (10Jhancock.wm) [17:52:22] (03PS1) 10Dzahn: gerrit: add acme_chief snippet to gerrit-setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074498 (https://phabricator.wikimedia.org/T372804) [17:52:41] (03CR) 10CI reject: [V:04-1] gerrit: add acme_chief snippet to gerrit-setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074498 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [17:54:58] (03PS2) 10Dzahn: gerrit: add acme_chief snippet to gerrit-setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074498 (https://phabricator.wikimedia.org/T372804) [17:55:19] (03CR) 10CI reject: [V:04-1] gerrit: add acme_chief snippet to gerrit-setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074498 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [17:58:23] (03PS3) 10Dzahn: gerrit: add acme_chief snippet to gerrit-setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074498 (https://phabricator.wikimedia.org/T372804) [17:58:41] (03CR) 10CI reject: [V:04-1] gerrit: add acme_chief snippet to gerrit-setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074498 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [17:59:16] why still picky, jerkins [18:11:20] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [18:11:30] (03PS4) 10Dzahn: gerrit: add acme_chief to gerrit-setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074498 (https://phabricator.wikimedia.org/T372804) [18:12:08] (03CR) 10JHathaway: [C:03+2] ci: upgrade to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [18:12:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd2004.codfw.wmnet with reason: host reimage [18:13:01] (03CR) 10JHathaway: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [18:13:02] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [18:13:50] (03CR) 10CI reject: [V:04-1] gerrit: add acme_chief to gerrit-setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074498 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:16:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd2004.codfw.wmnet with reason: host reimage [18:16:30] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [18:18:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd2005.codfw.wmnet with reason: host reimage [18:18:22] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [18:19:06] (03PS5) 10Dzahn: gerrit: add acme_chief to gerrit-setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074498 (https://phabricator.wikimedia.org/T372804) [18:22:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd2005.codfw.wmnet with reason: host reimage [18:23:13] (03CR) 10Dzahn: [C:03+2] "testing only on new hardware" [puppet] - 10https://gerrit.wikimedia.org/r/1074498 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:26:41] (03PS3) 10JHathaway: tftpboot: squash puppetserver log warning. [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) [18:26:46] (03PS18) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [18:26:53] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [18:29:55] (03CR) 10JHathaway: [C:03+2] tftpboot: squash puppetserver log warning. [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [18:31:44] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [18:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 20.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:35:44] (03Abandoned) 10JHathaway: WIP - ci test [puppet] - 10https://gerrit.wikimedia.org/r/1074486 (owner: 10JHathaway) [18:35:48] (03Abandoned) 10JHathaway: WIP - test [puppet] - 10https://gerrit.wikimedia.org/r/1074482 (owner: 10JHathaway) [18:36:40] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:38:02] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [18:38:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:42:53] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:44:26] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10164858 (10Dzahn) gerrit2003 now has a working apache-based gerrit::proxy with certs, no puppet errors and everything. except the actual gerrit application and we avoided ad... [18:57:50] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - elastic1089 - https://phabricator.wikimedia.org/T375281#10164873 (10Dzahn) [18:59:01] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - elastic1089 - https://phabricator.wikimedia.org/T375281#10164877 (10Dzahn) →14Duplicate dup:03T374897 [18:59:47] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - elastic1089 - https://phabricator.wikimedia.org/T374897#10164875 (10Dzahn) [19:13:54] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [19:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 19.62% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:14:48] (03PS9) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [19:17:42] 06SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790#10164918 (10Eevans) [19:19:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 22.39% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:24:59] (03PS2) 10Ebernhardson: [WIP] cirrus: Read from public and private streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073566 (https://phabricator.wikimedia.org/T374335) [19:28:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:28:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:28:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd2005.codfw.wmnet with OS bookworm [19:28:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd2004.codfw.wmnet with OS bookworm [19:28:39] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[4-5] - https://phabricator.wikimedia.org/T372512#10164934 (10Jhancock.wm) [19:29:08] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[4-5] - https://phabricator.wikimedia.org/T372512#10164937 (10Jhancock.wm) 05Open→03Resolved @colewhite this is complete! [19:42:24] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [19:46:52] (03CR) 10JHathaway: [C:03+1] gitlab: replace legacy Hiera facts with newer syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074493 (owner: 10Dzahn) [19:47:02] (03CR) 10JHathaway: [C:03+1] devtools/hiera: replace legacy facts for puppet 8 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1074491 (owner: 10Dzahn) [19:52:39] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:52:39] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:01:55] FIRING: [8x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:06:25] RESOLVED: [8x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:14] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375304 (10phaultfinder) 03NEW [20:25:19] 06SRE-OnFire, 10Incident Tooling: Corto: Licensing & copyright information - https://phabricator.wikimedia.org/T375305 (10Eevans) 03NEW [20:25:40] 06SRE-OnFire, 10Incident Tooling: Corto: Licensing & copyright information - https://phabricator.wikimedia.org/T375305#10165091 (10Eevans) p:05Triage→03Medium [20:30:19] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:35:20] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [21:20:14] (03PS1) 10Dwisehaupt: frack: remove frlog2001 and frpm2001 for decom [dns] - 10https://gerrit.wikimedia.org/r/1074537 (https://phabricator.wikimedia.org/T375239) [21:23:11] (03PS1) 10Dwisehaupt: icinga: remove frlog2001 and frpm2001 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1074538 (https://phabricator.wikimedia.org/T375239) [21:24:49] !log dwisehaupt@cumin1002 START - Cookbook sre.dns.netbox [21:26:13] (03CR) 10Dwisehaupt: "Hosts are both powered off and set with downtime. This can roll at any point." [puppet] - 10https://gerrit.wikimedia.org/r/1074538 (https://phabricator.wikimedia.org/T375239) (owner: 10Dwisehaupt) [21:28:29] (03CR) 10Jgreen: [C:03+1] frack: remove frlog2001 and frpm2001 for decom [dns] - 10https://gerrit.wikimedia.org/r/1074537 (https://phabricator.wikimedia.org/T375239) (owner: 10Dwisehaupt) [21:28:57] (03CR) 10Jgreen: [C:03+1] icinga: remove frlog2001 and frpm2001 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1074538 (https://phabricator.wikimedia.org/T375239) (owner: 10Dwisehaupt) [21:30:07] (03CR) 10Dwisehaupt: [C:03+2] frack: remove frlog2001 and frpm2001 for decom [dns] - 10https://gerrit.wikimedia.org/r/1074537 (https://phabricator.wikimedia.org/T375239) (owner: 10Dwisehaupt) [21:31:09] !log dwisehaupt@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: decommission frlog2001 and frpm2001 - dwisehaupt@cumin1002" [21:31:47] !log dwisehaupt@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: decommission frlog2001 and frpm2001 - dwisehaupt@cumin1002" [21:31:47] !log dwisehaupt@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:32:51] (03PS1) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [21:33:23] i aborted that cookbook run since there were some hiera changes that weren't mine after the host decommissioning dns bits. [21:33:51] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:35:33] (03PS2) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [21:35:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:36:18] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: Decommission frack hosts: frpm2001 - https://phabricator.wikimedia.org/T375297#10165162 (10Dwisehaupt) a:05Dwisehaupt→03None [21:37:21] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: Decommission frack hosts: frlog2001 - https://phabricator.wikimedia.org/T375239#10165166 (10Dwisehaupt) a:05Dwisehaupt→03None [21:41:18] (03PS3) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [21:41:47] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:44:05] (03PS4) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [21:44:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:55:23] (03PS5) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [21:55:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [22:00:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10165191 (10phaultfinder) [22:40:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10165254 (10Dwisehaupt) Can we get verification on the status of these hosts? Are they racked, cabled, and ready for build out? [22:41:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10165255 (10Dwisehaupt) Can we get verification on the status of this host? Are they racked, cabled, and ready for build out? [22:41:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10165256 (10Dwisehaupt) Can we get verification on the status of this host? Are they racked, cabled, and ready for build out? [22:42:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10165258 (10Dwisehaupt) Can we get verification on the status of these hosts? Are they racked, cabled, and ready for buildout? [22:54:51] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375304#10165263 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate [22:59:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:59:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:01:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:01:34] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:02:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1040.eqiad.wmnet with OS bookworm [23:02:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10165273 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1040.eqiad.wmnet with OS bookworm [23:25:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10165299 (10Jclark-ctr) @MoritzMuehlenhoff can you update puppet site.pp is missing these servers. also please verify preseed.yaml is updated Thanks [23:26:45] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10165301 (10Jclark-ctr) ` Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda2[2] sdb2[1] 185469952 blocks super 1.2 [2/2] [UU] bitm... [23:27:22] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10165302 (10Jclark-ctr) 05Open→03Resolved [23:27:30] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10165303 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr [23:29:54] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10165304 (10Jclark-ctr) @Andrew i see this ticket is in my name. is there something i need to do for this? [23:30:57] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10165306 (10Jclark-ctr) @dcaro did you have an update with what servers and drives I can send? I will reach out o... [23:38:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bullseye [23:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074548 [23:46:54] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10165309 (10Jclark-ctr) {F57526622} I got this failure and will not go past. @VRiley-WMF have you gotten anywhere with dell? [23:56:54] 06SRE-OnFire, 10Incident Tooling: Corto: configuration improvements - https://phabricator.wikimedia.org/T375309 (10Eevans) 03NEW [23:58:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q1): Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10165340 (10Jclark-ctr) If sdd was the drive replaced which assume from dmesg ` [Thu Sep 12 16:24:57 2024] sd 0:0:4:0: [sdd] Attached SCSI disk ` https://wiki... [23:59:07] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T374652#10165342 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate of T362841