[00:01:13] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1082868 [00:01:57] (03Abandoned) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1082868 (owner: 10Ncmonitor) [00:07:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082866 (owner: 10TrainBranchBot) [00:08:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082871 [00:08:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082871 (owner: 10TrainBranchBot) [00:12:06] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1082869 (owner: 10Ncmonitor) [00:12:29] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1082870 (owner: 10Ncmonitor) [00:41:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082871 (owner: 10TrainBranchBot) [00:57:05] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/e97f5e2f4cf03d92d8d84315aca46faf41f06ebb2de5e65af9718bb6377c76d0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:15:25] FIRING: [2x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:17:05] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:17:29] PROBLEM - Host cloudsw1-c8-eqiad.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:17:47] RECOVERY - Host cloudsw1-c8-eqiad.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [01:20:25] FIRING: [4x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:25:25] RESOLVED: [4x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:30:40] FIRING: [4x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:40] RESOLVED: [4x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:42:25] FIRING: [4x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:47:25] FIRING: [5x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:54:17] (03PS4) 10Scott French: httpd: introduce -next track and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) [01:57:25] FIRING: [5x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:30] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:12:25] FIRING: [5x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:17:25] RESOLVED: [4x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:27] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:22:40] FIRING: [4x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:26:51] (03PS1) 10Pppery: Missing.php: redirect wikisources to localized main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082875 [02:37:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:40] RESOLVED: [2x] SystemdUnitFailed: mediawiki_job_startupregistrystats-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:10] FIRING: [2x] SystemdUnitFailed: mediawiki_job_startupregistrystats-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:55] RESOLVED: [2x] SystemdUnitFailed: mediawiki_job_startupregistrystats-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:44:10] FIRING: [3x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:39] (03PS2) 10Pppery: Missing.php: redirect wikisources to localized main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082875 [02:48:56] (03PS3) 10Pppery: Missing.php: redirect wikisources to localized main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082875 [03:02:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:20] (03PS1) 10Dreamrimmer: Allow admins on testwiki to grant and remove upwizcampeditors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082878 (https://phabricator.wikimedia.org/T378067) [03:13:55] RESOLVED: [2x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:13:57] PROBLEM - SSH on mwmaint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:17:47] RECOVERY - SSH on mwmaint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:22:25] FIRING: [4x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:28:05] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:28:23] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:28:37] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:07:25] RESOLVED: [4x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:25] FIRING: [4x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:02] (03CR) 10Blake Hale: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/999561 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [04:14:17] (03CR) 10Blake Hale: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999561 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [04:14:25] FIRING: [4x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:38] (03CR) 10Blake Hale: "Here you go" [puppet] - 10https://gerrit.wikimedia.org/r/999561 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [04:15:38] (03CR) 10Blake Hale: [C:03+1] Change all role contacts for Data Engineering -> Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/999561 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [04:19:25] FIRING: [4x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:25] FIRING: [3x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:24:25] RESOLVED: [3x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:25:57] PROBLEM - SSH on mwmaint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:31:35] 06SRE, 06DBA, 07Wikimedia-production-error: Parsercache issues in codfw causing large-scale outage - https://phabricator.wikimedia.org/T378076#10261619 (10ABran-WMF) [05:33:02] 06SRE, 06DBA, 07Wikimedia-production-error: Parsercache issues in codfw causing large-scale outage - https://phabricator.wikimedia.org/T378076#10261622 (10ABran-WMF) >>! In T378076#10258297, @Samwalton9-WMF wrote: > https://www.wikimediastatus.net/incidents/b406lmnx5s57 this is indeed a side effect of the t... [05:35:49] RECOVERY - SSH on mwmaint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:39:25] FIRING: [3x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:43:21] (03CR) 10Fabfur: liberica: provide a liberica module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1080708 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [05:43:30] (03CR) 10Fabfur: profile: Provide a liberica profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [05:47:59] PROBLEM - SSH on mwmaint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:48:49] RECOVERY - SSH on mwmaint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:49:25] RESOLVED: [3x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:25] FIRING: [3x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241025T0600) [06:02:25] RESOLVED: [3x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:25] FIRING: [3x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:30] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:36] (03CR) 10Slyngshede: [C:03+1] "Looks good, inline nit about wording." [puppet] - 10https://gerrit.wikimedia.org/r/1082826 (https://phabricator.wikimedia.org/T370424) (owner: 10Ottomata) [06:12:59] PROBLEM - SSH on mwmaint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:13:49] RECOVERY - SSH on mwmaint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:14:25] RESOLVED: [3x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:37] !log jmm@cumin1002 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org [06:23:25] FIRING: [4x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:27:59] !log jmm@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org [06:29:23] (03CR) 10Ayounsi: [C:03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [06:29:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast1003.wikimedia.org [06:34:05] (03CR) 10Muehlenhoff: admin - explicit approval not needed for analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082826 (https://phabricator.wikimedia.org/T370424) (owner: 10Ottomata) [06:38:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast1003.wikimedia.org [06:38:38] 06SRE, 06Data-Platform-SRE, 10Data-Engineering (Q2 2024 October 1st - December 31th), 13Patch-For-Review: Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10261667 (10MoritzMuehlenhoff) >>! In T370424#10051517, @BTullis wrote: > This sounds sensible to me,... [06:40:38] 06SRE, 06Data-Platform-SRE, 10Data-Engineering (Q2 2024 October 1st - December 31th), 13Patch-For-Review: Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10261668 (10MoritzMuehlenhoff) What's the rationale for treating non-staff different? Is it intention... [06:47:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet [06:47:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10261672 (10ops-monitoring-bot) Draining ganeti2014.codfw.wmnet of running VMs [06:47:35] PROBLEM - ganeti-confd running on ganeti2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [06:48:29] PROBLEM - ganeti-noded running on ganeti2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [06:51:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2014.codfw.wmnet [06:51:30] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:53:32] ^ expected, ganeti2013 is being removed from active service [06:53:58] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2013 from active Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1082795 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [06:54:25] (03PS3) 10Muehlenhoff: Cover one more case in the setup of Envoy firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1082806 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241025T0700) [07:13:25] FIRING: [4x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:59] PROBLEM - SSH on mwmaint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:17:49] RECOVERY - SSH on mwmaint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:18:25] RESOLVED: [4x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:30] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:23:40] FIRING: [5x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2004.codfw.wmnet to drbd [07:43:17] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10261702 (10ops-monitoring-bot) VM kubestagemaster2004.codfw.wmnet switching disk type to drbd [07:48:40] RESOLVED: [4x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:59] PROBLEM - SSH on mwmaint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:52:29] (03CR) 10Elukey: "Thanks a lot for kicking this off! I don't particularly love the -next suffix in this case since it may be a moving target, I'd prefer to " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [07:58:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2004.codfw.wmnet to drbd [08:00:57] FIRING: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:01:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2004.codfw.wmnet to plain [08:01:30] FIRING: [6x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:01:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10261731 (10ops-monitoring-bot) VM kubestagemaster2004.codfw.wmnet switching disk type to plain [08:01:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2004.codfw.wmnet to plain [08:02:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet [08:02:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2014.codfw.wmnet [08:03:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2004.codfw.wmnet to drbd [08:03:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10261732 (10ops-monitoring-bot) VM kubestagemaster2004.codfw.wmnet switching disk type to drbd [08:11:35] !log imported openjdk-8 8u422-b05-1~deb12u1 to component/jdk for bookworm-wikimedia [08:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2004.codfw.wmnet to drbd [08:18:52] (03CR) 10Slyngshede: [C:03+1] admin - explicit approval not needed for analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082826 (https://phabricator.wikimedia.org/T370424) (owner: 10Ottomata) [08:21:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet [08:21:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10261764 (10ops-monitoring-bot) Draining ganeti2014.codfw.wmnet of running VMs [08:22:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2014.codfw.wmnet [08:22:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2004.codfw.wmnet to plain [08:22:57] RESOLVED: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:23:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10261765 (10ops-monitoring-bot) VM kubestagemaster2004.codfw.wmnet switching disk type to plain [08:24:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2004.codfw.wmnet to plain [08:24:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet [08:24:33] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10261766 (10ops-monitoring-bot) Draining ganeti2014.codfw.wmnet of running VMs [08:27:01] !log installing wireshark security updates [08:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:55] RECOVERY - SSH on mwmaint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:33:03] PROBLEM - SSH on mwmaint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:40:38] 06SRE, 10decommission-hardware: decommission ganeti2011/ganeti2012 - https://phabricator.wikimedia.org/T378171 (10MoritzMuehlenhoff) 03NEW [08:41:35] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10261799 (10MoritzMuehlenhoff) [08:42:25] (03PS1) 10Jelto: wikidata-query-gui: bump image version to 2024-10-25-083213 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083135 (https://phabricator.wikimedia.org/T350793) [08:42:54] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2011.codfw.wmnet [08:44:19] (03PS1) 10Btullis: Pause the XML/SQL dumps due to potential data quality issues [puppet] - 10https://gerrit.wikimedia.org/r/1083136 (https://phabricator.wikimedia.org/T377594) [08:44:53] RECOVERY - SSH on mwmaint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:46:13] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1083136 (https://phabricator.wikimedia.org/T377594) (owner: 10Btullis) [08:47:35] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:49:03] PROBLEM - SSH on mwmaint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:49:04] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: bump image version to 2024-10-25-083213 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083135 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [08:49:53] RECOVERY - SSH on mwmaint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:50:13] (03Merged) 10jenkins-bot: wikidata-query-gui: bump image version to 2024-10-25-083213 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083135 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [08:51:15] (03PS2) 10Btullis: Pause the XML/SQL dumps due to potential data quality issues [puppet] - 10https://gerrit.wikimedia.org/r/1083136 (https://phabricator.wikimedia.org/T377594) [08:52:58] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4379/co" [puppet] - 10https://gerrit.wikimedia.org/r/1083136 (https://phabricator.wikimedia.org/T377594) (owner: 10Btullis) [08:53:03] PROBLEM - SSH on mwmaint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:53:05] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [08:53:53] RECOVERY - SSH on mwmaint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:54:29] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [09:01:27] PROBLEM - SSH on mwmaint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:04:35] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2011.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:04:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2011.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:04:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:04:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2011.codfw.wmnet [09:05:09] 06SRE, 10decommission-hardware: decommission ganeti2011/ganeti2012 - https://phabricator.wikimedia.org/T378171#10261882 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti2011.codfw.wmnet` - ganeti2011.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertman... [09:06:12] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2012.codfw.wmnet [09:07:17] RECOVERY - SSH on mwmaint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:11:25] FIRING: [16x] SystemdUnitFailed: mediawiki_job_campaignevents-aggregateparticipantanswers-metawiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:06] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:16:25] FIRING: [16x] SystemdUnitFailed: mediawiki_job_campaignevents-aggregateparticipantanswers-metawiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:17:01] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2012.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:17:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2012.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:17:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:17:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2012.codfw.wmnet [09:17:39] 06SRE, 10decommission-hardware: decommission ganeti2011/ganeti2012 - https://phabricator.wikimedia.org/T378171#10261897 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti2012.codfw.wmnet` - ganeti2012.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertman... [09:21:25] FIRING: [16x] SystemdUnitFailed: mediawiki_job_campaignevents-aggregateparticipantanswers-metawiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:53] 06SRE, 10decommission-hardware: decommission ganeti2011/ganeti2012 - https://phabricator.wikimedia.org/T378171#10261900 (10MoritzMuehlenhoff) [09:26:25] FIRING: [16x] SystemdUnitFailed: mediawiki_job_campaignevents-aggregateparticipantanswers-metawiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:12] (03CR) 10Marco Fossati: [C:03+1] "LGTM, but I don't have permission to +2!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082809 (https://phabricator.wikimedia.org/T377988) (owner: 10Cparle) [09:28:30] (03PS5) 10Clément Goubert: php*-fpm-multiversion: Add helper scripts for mwcron, mwscript [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) [09:28:48] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti2011/ganeti2012 - https://phabricator.wikimedia.org/T378171#10261902 (10MoritzMuehlenhoff) [09:35:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082809 (https://phabricator.wikimedia.org/T377988) (owner: 10Cparle) [09:40:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10261945 (10elukey) Tried to upgrade the SAS3809's firmware on ms-be2083 to see if the JBOD disks would be picked up, but no luck (followed https://www.su... [09:41:25] FIRING: [10x] SystemdUnitFailed: mediawiki_job_campaignevents-aggregateparticipantanswers-metawiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:35] (03PS2) 10EoghanGaffney: apt-staging: Import packages with gitlab-package-puller [puppet] - 10https://gerrit.wikimedia.org/r/1080069 [09:45:16] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10observability: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10261974 (10elukey) Summary of an IRC chat between me, Jayme and Matthew: * The absence of the BBU... [09:48:13] (03PS1) 10Muehlenhoff: profile::phabricator::migration: Fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/1083148 [09:51:56] (03PS1) 10Muehlenhoff: datahubsearch: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1083149 [09:52:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083149 (owner: 10Muehlenhoff) [09:52:51] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1082782 (owner: 10Slyngshede) [09:56:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2014.codfw.wmnet [09:57:00] (03PS1) 10Clément Goubert: Revert "mw-debug: Recreate instead of RollingUpdate" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083150 (https://phabricator.wikimedia.org/T374907) [09:58:17] (03PS1) 10Muehlenhoff: Remove ganeti2014 from active Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1083151 (https://phabricator.wikimedia.org/T376594) [09:59:32] (03PS2) 10Alexandros Kosiaris: Revert "mw-debug: Recreate instead of RollingUpdate" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083150 (https://phabricator.wikimedia.org/T374907) (owner: 10Clément Goubert) [10:04:47] (03CR) 10Alexandros Kosiaris: [C:03+2] Revert "mw-debug: Recreate instead of RollingUpdate" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083150 (https://phabricator.wikimedia.org/T374907) (owner: 10Clément Goubert) [10:05:10] (03PS6) 10Clément Goubert: php*-fpm-multiversion: Add helper scripts for mwcron, mwscript [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) [10:05:39] (03CR) 10Clément Goubert: php*-fpm-multiversion: Add helper scripts for mwcron, mwscript (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) (owner: 10Clément Goubert) [10:05:51] (03Merged) 10jenkins-bot: Revert "mw-debug: Recreate instead of RollingUpdate" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083150 (https://phabricator.wikimedia.org/T374907) (owner: 10Clément Goubert) [10:10:21] (03PS3) 10Cathal Mooney: Interface automation templates for pfw devices [homer/public] - 10https://gerrit.wikimedia.org/r/1082716 (https://phabricator.wikimedia.org/T378070) [10:12:13] (03PS3) 10EoghanGaffney: apt-staging: Import packages with gitlab-package-puller [puppet] - 10https://gerrit.wikimedia.org/r/1080069 [10:12:34] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:15:35] (03CR) 10Elukey: "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [10:15:47] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:16:17] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:17:03] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:18:10] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:19:08] (03CR) 10Elukey: "Also this on sretest2001 (supermicro)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [10:20:54] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:21:13] this is me --^ [10:21:24] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:24:24] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [10:31:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'maintenance', diff saved to https://phabricator.wikimedia.org/P70588 and previous config saved to /var/cache/conftool/dbconfig/20241025-103157-arnaudb.json [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241025T0700) [11:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241025T1100). [11:13:09] (03CR) 10Ayounsi: [C:03+1] "I think those configuration option only apply to UEFI boot. So we might be able to only set them up once for the ideal EFI setup. And they" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [11:14:47] (03PS3) 10Sergio Gimeno: [Growth] beta: configure the A/B test experiment variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) [11:14:47] (03PS2) 10Sergio Gimeno: GrowthExperiments: enable community updates module in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081104 (https://phabricator.wikimedia.org/T374664) [11:24:19] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Deepesha_WMDE - https://phabricator.wikimedia.org/T378181 (10Deepesha_WMDE) 03NEW [11:29:20] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182 (10Deepesha_WMDE) 03NEW [11:30:30] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378181#10262181 (10Deepesha_WMDE) [11:37:58] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2014 from active Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1083151 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [11:40:40] PROBLEM - ganeti-noded running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:40:48] PROBLEM - ganeti-confd running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:41:30] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:24] (03PS1) 10Muehlenhoff: Deprecate system::role for Hadoop roles [puppet] - 10https://gerrit.wikimedia.org/r/1083157 [11:50:37] (03PS1) 10Muehlenhoff: Deprecate system::role for Swift roles [puppet] - 10https://gerrit.wikimedia.org/r/1083158 [11:54:27] (03PS1) 10Muehlenhoff: Remove obsolete role [puppet] - 10https://gerrit.wikimedia.org/r/1083159 (https://phabricator.wikimedia.org/T359387) [11:58:39] (03PS1) 10Muehlenhoff: Deprecate system::role for memcached/redis roles [puppet] - 10https://gerrit.wikimedia.org/r/1083160 [12:01:25] FIRING: [8x] SystemdUnitFailed: mediawiki_job_campaignevents-aggregateparticipantanswers-metawiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:48] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10262250 (10darthmon_wmde) [12:03:03] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10262252 (10darthmon_wmde) hi there, hereby I am backing up this request as @Deepesha_WMDE 's team lead for Wikibase Suite team at WMDE cheers [12:06:25] FIRING: [8x] SystemdUnitFailed: mediawiki_job_campaignevents-aggregateparticipantanswers-metawiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:31] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10262256 (10darthmon_wmde) [12:06:51] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378181#10262261 (10darthmon_wmde) 05Duplicate→03Open [12:07:17] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378181#10262262 (10darthmon_wmde) hi there, hereby I am backing up this request as @Deepesha_WMDE 's team lead for Wikibase Suite team at WMDE cheers [12:08:03] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378181#10262254 (10darthmon_wmde) →14Duplicate dup:03T378182 [12:16:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10262269 (10MoritzMuehlenhoff) [12:21:30] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:01] (03PS1) 10Muehlenhoff: ganeti-test: Enable puppet-managed /var/lib/ganeti/known_hosts for the role [puppet] - 10https://gerrit.wikimedia.org/r/1083165 (https://phabricator.wikimedia.org/T309724) [12:24:26] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [12:36:36] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:37:04] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:37:04] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:52:39] 06SRE, 10Wikimedia-Mailing-lists: Create a mail address for Russian Wikipedia oversighters - https://phabricator.wikimedia.org/T378069#10262291 (10MBH) @Dzahn ruwiki's oversighters are @DR @Leloiandudu @Q-bit-array and maybe @Tatewaki (I'm not sure this is his account, but he has this username in wiki). You ca... [13:03:33] (03CR) 10Elukey: "Perfect! Going to wait Riccardo for his final validation and then I'll merge!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [13:06:01] 06SRE, 06Data-Platform-SRE, 10Data-Engineering (Q2 2024 October 1st - December 31th), 13Patch-For-Review: Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10262307 (10Ottomata) > Mid-term the approval management will move to Bitu/idm.wikimedia.org COOL! >... [13:08:22] (03PS1) 10Daimona Eaytoy: Enable CampaignEvents collaboration list by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083170 (https://phabricator.wikimedia.org/T375141) [13:08:38] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:09:04] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:09:04] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:09:06] (03PS2) 10Daimona Eaytoy: beta: Drop $wgCampaignEventsShowEventInvitationSpecialPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077370 (https://phabricator.wikimedia.org/T373442) [13:10:05] (03PS2) 10Daimona Eaytoy: prod: Drop $wgCampaignEventsShowEventInvitationSpecialPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077371 (https://phabricator.wikimedia.org/T373442) [13:11:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083170 (https://phabricator.wikimedia.org/T375141) (owner: 10Daimona Eaytoy) [13:11:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077370 (https://phabricator.wikimedia.org/T373442) (owner: 10Daimona Eaytoy) [13:11:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077371 (https://phabricator.wikimedia.org/T373442) (owner: 10Daimona Eaytoy) [13:15:04] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:15:04] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:15:38] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:28:38] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:29:04] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:29:04] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:50:04] (03CR) 10Clément Goubert: [C:04-1] "In the interest of not breaking deployment-prep/beta, we're keeping the code around for some (currently undecided and undefined) time." [puppet] - 10https://gerrit.wikimedia.org/r/1083159 (https://phabricator.wikimedia.org/T359387) (owner: 10Muehlenhoff) [14:00:35] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1083157 (owner: 10Muehlenhoff) [14:02:25] (03CR) 10Ssingh: [C:03+2] tox.ini: add Python 3.11 to interpreters (and remove 3.7) [dns] - 10https://gerrit.wikimedia.org/r/1082548 (owner: 10Ssingh) [14:02:55] !log running authdns-update for CR 1082548 [14:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:58] 06SRE, 06Data-Platform-SRE: Rate Limited for data science project - https://phabricator.wikimedia.org/T378184#10262566 (10ssingh) [14:07:44] 06SRE, 06Data-Platform-SRE: Rate Limited for data science project - https://phabricator.wikimedia.org/T378184#10262563 (10ssingh) p:05Triage→03Medium [14:09:09] 06SRE, 06Data-Platform-SRE: Rate Limited for data science project - https://phabricator.wikimedia.org/T378184#10262568 (10ssingh) Hi: For some more context, Matt reached out to us on IRC (`#wikimedia-analytics`) and I asked them to file a task here. [14:10:09] 10ops-eqiad, 06DC-Ops: Q#:rack/setup/install X - https://phabricator.wikimedia.org/T378185 (10Jclark-ctr) 03NEW [14:13:38] (03CR) 10Xcollazo: [C:03+1] Pause the XML/SQL dumps due to potential data quality issues [puppet] - 10https://gerrit.wikimedia.org/r/1083136 (https://phabricator.wikimedia.org/T377594) (owner: 10Btullis) [14:13:56] 10ops-eqiad, 06DC-Ops: Q#:rack/setup/install X - https://phabricator.wikimedia.org/T378185#10262588 (10Jclark-ctr) [14:18:21] (03CR) 10Btullis: [V:03+1 C:03+2] Pause the XML/SQL dumps due to potential data quality issues [puppet] - 10https://gerrit.wikimedia.org/r/1083136 (https://phabricator.wikimedia.org/T377594) (owner: 10Btullis) [14:18:34] (03CR) 10Ayounsi: "nice! hard to mentally parse it all, but no red flags and the overall logic lgtm! Does it run as noop?" [homer/public] - 10https://gerrit.wikimedia.org/r/1082716 (https://phabricator.wikimedia.org/T378070) (owner: 10Cathal Mooney) [14:18:39] (03CR) 10Ayounsi: [C:03+1] Interface automation templates for pfw devices [homer/public] - 10https://gerrit.wikimedia.org/r/1082716 (https://phabricator.wikimedia.org/T378070) (owner: 10Cathal Mooney) [14:31:27] !log alert1002: manually killed stunnel4 process to clear puppet failure T375143 [14:31:30] (03PS15) 10Clément Goubert: Provide conftool data for mwcron and mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1083146 (https://phabricator.wikimedia.org/T341555) [14:31:30] (03CR) 10Clément Goubert: "PCC fails for beta/deployment prep but it was already broken before the change. I think this is the right approach though." [puppet] - 10https://gerrit.wikimedia.org/r/1083146 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:05] T375143: Sirenbot keeps joining and parting channels after failing over to alert1002 - https://phabricator.wikimedia.org/T375143 [14:32:58] (03CR) 10Ssingh: "CI is failing, I think we should look why and update it." [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [14:35:49] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker12[43-58] - https://phabricator.wikimedia.org/T378185#10262648 (10Jclark-ctr) [14:36:03] (03CR) 10Cathal Mooney: "There were a few changes to interface descriptions was all, manual ones slightly differed from our convention. I pushed all those changes" [homer/public] - 10https://gerrit.wikimedia.org/r/1082716 (https://phabricator.wikimedia.org/T378070) (owner: 10Cathal Mooney) [14:37:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:01] 06SRE, 10LDAP-Access-Requests: Access issue for golson-wmf - https://phabricator.wikimedia.org/T378187 (10Tsevener) 03NEW [14:38:45] (03PS1) 10Bking: kafka-stretch2001: grant access to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1083187 (https://phabricator.wikimedia.org/T376813) [14:39:00] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083187 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [14:43:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1083187 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [14:44:09] (03CR) 10Btullis: [C:03+1] kafka-stretch2001: grant access to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1083187 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [14:47:18] (03PS3) 10Sergio Gimeno: GrowthExperiments: enable community updates module in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081104 (https://phabricator.wikimedia.org/T374664) [14:48:39] (03PS4) 10Sergio Gimeno: [Growth] beta: configure the A/B test experiment variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) [14:49:54] (03CR) 10Sergio Gimeno: [C:04-1] "Scheduled for October 30." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081104 (https://phabricator.wikimedia.org/T374664) (owner: 10Sergio Gimeno) [14:50:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081104 (https://phabricator.wikimedia.org/T374664) (owner: 10Sergio Gimeno) [14:55:31] (03CR) 10Ayounsi: "I guess it's ready for a live test? At first to make sure that it doesn't impact non UEFI devices, and then it can be fine tuned if any is" [puppet] - 10https://gerrit.wikimedia.org/r/1082288 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [14:57:43] (03CR) 10Ayounsi: [C:03+1] Add additional ignore line to Juniper warnings for Homer [puppet] - 10https://gerrit.wikimedia.org/r/1082728 (https://phabricator.wikimedia.org/T378070) (owner: 10Cathal Mooney) [15:00:38] (03PS32) 10Fabfur: haproxykafka: start working on haproxykafka puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1081264 (https://phabricator.wikimedia.org/T374128) [15:01:48] (03PS4) 10Cathal Mooney: Interface automation templates for pfw devices [homer/public] - 10https://gerrit.wikimedia.org/r/1082716 (https://phabricator.wikimedia.org/T378070) [15:02:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:18] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti2011/ganeti2012 - https://phabricator.wikimedia.org/T378171#10262757 (10Jhancock.wm) 05Open→03Resolved [15:03:30] !log dancy@deploy2002 Installing scap version "4.118.0" for 209 hosts [15:04:45] (03CR) 10Cathal Mooney: Interface automation templates for pfw devices (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1082716 (https://phabricator.wikimedia.org/T378070) (owner: 10Cathal Mooney) [15:05:19] (03CR) 10Bking: [C:03+2] kafka-stretch2001: grant access to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1083187 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [15:07:41] !log dancy@deploy2002 Installation of scap version "4.118.0" completed for 209 hosts [15:09:35] (03CR) 10Scott French: [C:03+1] "Thanks, claime!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) (owner: 10Clément Goubert) [15:19:38] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: wmf sink: extend it to support IPv6 - https://phabricator.wikimedia.org/T378192 (10aborrero) 03NEW [15:29:53] (03PS33) 10Fabfur: hapxykafka: start working on haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1081264 (https://phabricator.wikimedia.org/T374128) [15:29:53] (03PS1) 10Fabfur: haproxykafka: adding profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083191 (https://phabricator.wikimedia.org/T374128) [15:31:04] (03PS34) 10Fabfur: haproxykafka: start working on haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1081264 (https://phabricator.wikimedia.org/T374128) [15:31:04] (03PS2) 10Fabfur: haproxykafka: adding profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083191 (https://phabricator.wikimedia.org/T374128) [15:31:47] (03CR) 10Ssingh: "Mostly nits and some questions and suggestions." [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [15:36:38] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10262855 (10hnowlan) 05Open→03Stalled [15:37:57] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10262853 (10hnowlan) This request first requires signing an NDA with Legal - tagging @KFrancis as per the access request process. Thanks! [15:40:24] (03PS3) 10Fabfur: haproxykafka: adding profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083191 (https://phabricator.wikimedia.org/T374128) [15:40:49] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378181#10262857 (10hnowlan) 05Open→03Invalid closing as dupe, following up in T378181 [15:47:22] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:50:57] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10262881 (10hnowlan) This access requires signing an NDA, adding @KFrancis as per access request documentation. Thanks! [15:51:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding backup2012 to codfw - jhancock@cumin2002" [15:51:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding backup2012 to codfw - jhancock@cumin2002" [15:51:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:52:05] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host backup2012 [15:52:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup2012 [15:55:40] 06SRE-OnFire, 10Sustainability (Incident Followup): create a place (whiteboard) where SRE advertises current site status / things for awareness - https://phabricator.wikimedia.org/T378038#10262899 (10hnowlan) [15:56:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:57:04] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:59:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:01:20] 06SRE-OnFire, 10Incident Tooling: Corto: configuration improvements - https://phabricator.wikimedia.org/T375309#10262931 (10Eevans) [16:06:25] FIRING: [4x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:10] (03PS1) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [16:07:11] (03PS1) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [16:07:57] (03Abandoned) 10Fabfur: haproxykafka: adding profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083191 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [16:09:36] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [16:19:56] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:20:34] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:20:42] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:21:30] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:27:17] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10262990 (10Ahoelzl) Approved. [16:28:15] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10263004 (10xcollazo) 05Stalled→03Open [16:28:22] !log T378170 Ran mwscript-k8s extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=trwiki --logwiki=metawiki 'Peter.kerepesi' 'Peakbagger77' @ 11:57:19 UTC [16:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:38] T378170: Unblock stuck global rename of Peter.kerepesi - https://phabricator.wikimedia.org/T378170 [16:32:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:35:18] (03CR) 10Urbanecm: [C:03+1] "LGTM, once the corresponding GE patch lands." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [16:40:31] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10263049 (10Jhancock.wm) [16:42:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2012.codfw.wmnet with OS bookworm [16:42:42] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10263058 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2012.codfw.wmnet with OS bookworm [16:42:48] (03PS1) 10Hnowlan: services_proxy: add tcp_keepalive parameter, enable for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1083207 (https://phabricator.wikimedia.org/T373517) [16:44:46] (03CR) 10Hnowlan: [C:03+1] shellbox: pin all instances at live image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082317 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [16:45:40] (03CR) 10Hnowlan: [C:03+1] shellbox-syntaxhighlight: upgrade to 2024-10-15-214239 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082318 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [16:45:59] (03CR) 10Hnowlan: [C:03+1] shellbox: upgrade to 2024-10-15-214239 (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082319 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [16:46:23] (03CR) 10JHathaway: "> I guess it's ready for a live test? At first to make sure that it doesn't impact non UEFI devices, and then it can be fine tuned if any " [puppet] - 10https://gerrit.wikimedia.org/r/1082288 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [16:47:17] (03PS2) 10Hnowlan: services_proxy: add tcp_keepalive parameter, enable for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1083207 (https://phabricator.wikimedia.org/T373517) [16:49:13] (03PS3) 10Hnowlan: services_proxy: add tcp_keepalive parameter, enable for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1083207 (https://phabricator.wikimedia.org/T373517) [16:50:21] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4382/co" [puppet] - 10https://gerrit.wikimedia.org/r/1083207 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [16:59:02] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378181#10263085 (10Aklapper) →14Duplicate dup:03T378182 [16:59:06] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10263086 (10Aklapper) [17:00:58] (03CR) 10Ssingh: "Looks good! Just a few questions and nits:" [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [17:18:47] (03CR) 10Scott French: "Thanks for flagging the preexisting PCC breakage." [puppet] - 10https://gerrit.wikimedia.org/r/1083146 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [17:36:40] (03CR) 10Hnowlan: "I was kinda mulling this over myself. I'm not really sure - there's a (hopefully very slim) chance something will go wrong and we'll have " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082731 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [17:38:22] (03PS10) 10Hnowlan: services_proxy: add tcp_keepalive parameter, enable for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1083207 (https://phabricator.wikimedia.org/T373517) [17:38:52] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4389/co" [puppet] - 10https://gerrit.wikimedia.org/r/1083207 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [17:39:54] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T378201 (10phaultfinder) 03NEW [17:50:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2012.codfw.wmnet with reason: host reimage [17:54:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2012.codfw.wmnet with reason: host reimage [18:18:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:25:10] (03CR) 10Eevans: [C:03+1] "Ack. It could also be re-added later if needed; Let's go ahead and remove it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082731 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [18:28:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:28:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2012.codfw.wmnet with OS bookworm [18:29:02] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10263484 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2012.codfw.wmnet with OS bookworm com... [18:29:17] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10263487 (10Jhancock.wm) [18:32:24] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10263492 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @jcrespo hey sorry about that. got this one conflated with another order. it's rea... [18:33:33] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10263500 (10KFrancis) >>! In T378082#10262852, @hnowlan wrote: > This request first requires signing an NDA with Legal - tagging @KFrancis as per the access request process. Thanks... [18:50:28] (03CR) 10Scott French: [C:03+1] "Looks good! Actually, this *is* the correct solution vs. trying to change the curl client behavior - i.e., this actually touches the path " [puppet] - 10https://gerrit.wikimedia.org/r/1083207 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [18:52:32] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10263534 (10KFrancis) Please provide Deepesha Burse's email address and I will process the NDA. Thanks! [20:06:25] FIRING: [4x] SystemdUnitFailed: mediawiki_job_cirrus_build_completion_indices_codfw.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:30] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:32:01] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 156438880 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:33:01] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6925992 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring