[00:01:12] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [00:02:24] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:03:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:03:48] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1042 [00:04:24] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1051.eqiad.wmnet with reason: host reimage [00:05:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1042 [00:05:22] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:08:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1051.eqiad.wmnet with reason: host reimage [00:08:53] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1041 [00:08:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1041 [00:08:59] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1043 [00:10:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1043 [00:10:20] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1044 [00:10:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1075087 (owner: 10TrainBranchBot) [00:11:05] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host db1246.eqiad.wmnet [00:11:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1044 [00:14:37] RECOVERY - ensure kvm processes are running on cloudvirt1063 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:15:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1042.eqiad.wmnet with OS bookworm [00:15:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1042.eqiad.wmnet with OS bookworm [00:20:36] !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm [00:20:44] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10169767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm [00:21:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1043.eqiad.wmnet with OS bookworm [00:21:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169768 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1043.eqiad.wmnet with OS bookworm [00:22:54] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:23:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:23:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1051.eqiad.wmnet with OS bookworm [00:23:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169769 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1051.eqiad.wmnet with OS bookworm completed:... [00:28:45] FIRING: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:29:23] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1042.eqiad.wmnet with reason: host reimage [00:32:36] PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:32:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1042.eqiad.wmnet with reason: host reimage [00:35:36] !log pt1979@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [00:39:37] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [00:40:18] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1043.eqiad.wmnet with reason: host reimage [00:40:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:41:32] FIRING: KubernetesCalicoDown: dse-k8s-worker1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:43:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1043.eqiad.wmnet with reason: host reimage [00:46:57] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:52:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:52:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1042.eqiad.wmnet with OS bookworm [00:52:20] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:52:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1042.eqiad.wmnet with OS bookworm completed:... [00:52:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:54:12] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:54:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:56:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169829 (10Jclark-ctr) [00:57:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169834 (10Jclark-ctr) [00:58:19] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:58:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:58:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1043.eqiad.wmnet with OS bookworm [00:58:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169837 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1043.eqiad.wmnet with OS bookworm completed:... [00:59:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169838 (10Jclark-ctr) [01:00:12] 10ops-eqiad, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T375455 (10phaultfinder) 03NEW [01:00:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169845 (10Jclark-ctr) [01:00:40] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1246.eqiad.wmnet with OS bookworm [01:00:46] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10169847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm completed: - db1246 (**WARN**) - Removed fr... [01:03:26] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:07:41] (03PS1) 10Papaul: Remove db1246 from using partman/custom/db.cfg end of testing [puppet] - 10https://gerrit.wikimedia.org/r/1075092 (https://phabricator.wikimedia.org/T374215) [01:08:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.24 [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075093 (https://phabricator.wikimedia.org/T373643) [01:08:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.24 [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075093 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot) [01:08:40] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:08:44] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:08:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1052.eqiad.wmnet with OS bookworm [01:08:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1052.eqiad.wmnet with OS bookworm completed:... [01:14:16] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10169864 (10Jclark-ctr) a:03VRiley-WMF [01:14:39] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [01:15:16] 10ops-eqiad, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T375455#10169867 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Corrected manual ip on new supermicro server [01:15:22] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:15:28] (03CR) 10Ssingh: "Sorry it took a while for the review." [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [01:15:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:16:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10169871 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Relocated 3 servers that had not been imaged to new rack [01:16:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:17:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:21:22] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 6.541 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:21:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:21:48] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:22:31] (03CR) 10Ssingh: "modules/varnish/templates/browsersec.body.html.erb also should be updated with the text." [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [01:29:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:30:18] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:33:39] PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:34:12] (03CR) 10Papaul: [C:03+2] Remove db1246 from using partman/custom/db.cfg end of testing [puppet] - 10https://gerrit.wikimedia.org/r/1075092 (https://phabricator.wikimedia.org/T374215) (owner: 10Papaul) [01:35:32] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.24 [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075093 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot) [01:42:28] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10169909 (10Papaul) @ABran-WMF - For step 1 testing Iused sudo cookbook sre.hosts-dhcp --os bullseye db1246 I destroyed the raid10 configuration and set it back up... [02:03:25] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:31:48] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10169921 (10Papaul) 05Open→03Resolved This is complete, thanks @Dwisehaupt @Jhancock.wm [02:33:31] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10169923 (10Papaul) [02:35:07] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10169924 (10Papaul) [02:38:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:04] PROBLEM - mysqld processes #page on db1246 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [02:40:06] PROBLEM - MariaDB read only s2 on db1246 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [02:40:11] PROBLEM - MariaDB Replica IO: s2 #page on db1246 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:40:11] PROBLEM - MariaDB Replica Lag: s2 #page on db1246 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:40:12] PROBLEM - MariaDB Replica SQL: s2 #page on db1246 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:46:09] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10169925 (10Papaul) [02:59:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:29] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075096 (https://phabricator.wikimedia.org/T373643) [03:01:30] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075096 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot) [03:02:16] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075096 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot) [03:02:34] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.24 refs T373643 [03:02:38] T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643 [03:05:26] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:30] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Server depooled. Has hardware issues [03:06:44] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Server depooled. Has hardware issues [03:06:55] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10169930 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a261b5e0-f6ea-4087-9a5c-74f99c8cbc7e) set by eevans@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their servi... [03:14:44] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:24:44] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:41:24] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10169937 (10jwang) 05Resolved→03Open @ssingh, My manager (@mpopov ) told me I should request for `airflow-analytics-product-admins` group, instead of `deployment... [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T0400) [04:04:56] FIRING: [3x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:04] !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.21 (duration: 06m 02s) [04:09:56] FIRING: [3x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:56] FIRING: [4x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:00] FIRING: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:29:06] (03PS1) 10RLazarus: deployment_server: mwscript_k8s usability tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1075098 [04:40:55] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:41:32] FIRING: KubernetesCalicoDown: dse-k8s-worker1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:03:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance [05:03:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance [05:10:37] (03CR) 10Ebrahim: "11 days and no response, is it possible for you to have a look at the namespaces also and to list issues you see so I can at least go for " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim) [05:13:17] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10169957 (10ABran-WMF) >>! In T375382#10168776, @VRiley-WMF wrote: > Hi! We do have a spare DIMM (32 gig, 2666mts) that we can swap at anytime for this unit. Please let us know when is the best... [05:39:32] (03CR) 10Arnaudb: [C:03+1] pc2017: Set it to master [puppet] - 10https://gerrit.wikimedia.org/r/1075052 (https://phabricator.wikimedia.org/T374355) (owner: 10Ladsgroup) [05:49:32] (03Abandoned) 10Arnaudb: mariadb: productionize db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1071570 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [05:55:18] !log centrallog1002 upgrade to bookworm in progress https://phabricator.wikimedia.org/T353912 [05:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:30] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375459 (10phaultfinder) 03NEW [05:58:16] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10169971 (10ABran-WMF) >>! In T375257#10168468, @wiki_willy wrote: > Hi @ABran-WMF - can you check with the onsite engineers @VRiley-WMF and @Jclark-ctr? Please also keep in mind this server is due to... [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T0600) [06:00:05] marostegui, Amir1, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T0600). [06:02:27] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375459#10169972 (10phaultfinder) [06:03:40] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:28:10] (03PS1) 10Arnaudb: mariadb: productionize db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1075108 (https://phabricator.wikimedia.org/T373579) [06:31:28] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:35:42] !log tappof@cumin2002 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet [06:38:20] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:38:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:38:28] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:39:35] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10170007 (10ABran-WMF) Amazing! @Papaul thanks for the help! [06:43:13] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:45:30] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073565 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson) [06:55:43] RESOLVED: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:56:30] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T0700). [07:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:09] o/ [07:00:15] I can deploy [07:01:20] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:01:24] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 719, down: 13, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:01:28] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:02:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073565 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson) [07:02:13] !log tappof@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet [07:02:50] (03Merged) 10jenkins-bot: Add a private variant of the cirrus update stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073565 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson) [07:03:20] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1073565|Add a private variant of the cirrus update stream (T374335)]] [07:03:24] T374335: The SUP producer should ship private wiki update events to a separate stream - https://phabricator.wikimedia.org/T374335 [07:07:07] !log dcausse@deploy1003 dcausse, ebernhardson: Backport for [[gerrit:1073565|Add a private variant of the cirrus update stream (T374335)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:12:52] !log dcausse@deploy1003 dcausse, ebernhardson: Continuing with sync [07:17:32] (03CR) 10Vgutierrez: [C:04-1] varnish: Occasional RSA cert connection warnings (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [07:23:28] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 674, down: 9, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:23:44] (03CR) 10Giuseppe Lavagetto: [C:04-1] "Overall very nice job, just one correction." [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 (owner: 10Alexandros Kosiaris) [07:27:31] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073565|Add a private variant of the cirrus update stream (T374335)]] (duration: 24m 11s) [07:27:35] T374335: The SUP producer should ship private wiki update events to a separate stream - https://phabricator.wikimedia.org/T374335 [07:28:14] (03PS1) 10Slyngshede: Dummy Gitlab tokens for IDM. [labs/private] - 10https://gerrit.wikimedia.org/r/1075115 (https://phabricator.wikimedia.org/T359820) [07:28:54] !log closing the backport window [07:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:40] (03PS1) 10Brouberol: Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950) [07:33:04] (03PS2) 10Brouberol: Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950) [07:34:27] (03PS3) 10Brouberol: Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950) [07:36:24] (03PS4) 10Brouberol: Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950) [07:37:13] (03CR) 10Vgutierrez: [C:03+2] "approved by @aotto@wikimedia.org on phab task" [puppet] - 10https://gerrit.wikimedia.org/r/1073834 (https://phabricator.wikimedia.org/T375060) (owner: 10Vgutierrez) [07:38:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10170093 (10Vgutierrez) 05Stalled→03In progress [07:39:26] (03PS5) 10Brouberol: Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950) [07:41:45] !log reboot cr3-ulsfo - T375345 [07:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:49] T375345: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345 [07:42:44] (03CR) 10Hashar: Check that throttling exceptions use valid public IP addresses (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [07:44:18] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:22] PROBLEM - Host cr3-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [07:44:46] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 62, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:46] PROBLEM - OSPF status on mr1-ulsfo is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:46:29] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10170101 (10ayounsi) From JTAC: > [...] after engaging further resources we have been requested to attempt a full chassis reboot and check if the issue persists before proceeding with the... [07:47:10] (03CR) 10Slyngshede: [V:03+2 C:03+2] Dummy Gitlab tokens for IDM. [labs/private] - 10https://gerrit.wikimedia.org/r/1075115 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [07:47:18] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:47:46] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:47:46] RECOVERY - OSPF status on mr1-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:47:50] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 89, down: 9, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:49:24] RECOVERY - Host cr3-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.60 ms [07:49:28] (03PS1) 10Slyngshede: C:idm Add gitlab configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) [07:50:19] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10170102 (10Vgutierrez) 05In progress→03Resolved ` vgutierrez@krb1001:~$ sudo manage_principals.py create cyndywikime --email_address=csimi... [07:50:33] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4094/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [07:51:16] (03CR) 10Muehlenhoff: "Looks good, a few comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [07:52:28] (03PS2) 10Slyngshede: C:idm Add gitlab configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) [08:03:59] (03CR) 10Muehlenhoff: "Also looks good, a few comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [08:06:13] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10170125 (10dcaro) [08:07:14] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10170130 (10dcaro) [08:11:35] (03Abandoned) 10Hashar: do not merge: CI should no longer complain [puppet] - 10https://gerrit.wikimedia.org/r/862857 (owner: 10Jbond) [08:14:56] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:37] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye [08:25:29] (03PS5) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) [08:28:32] (03PS6) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) [08:29:15] FIRING: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:29:45] (03CR) 10Volans: deployment_server: mwscript_k8s usability tweaks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075098 (owner: 10RLazarus) [08:29:58] (03CR) 10FNegri: [C:03+1] "Thanks Milimetric for the review. I'm gonna try to get a +1 from the Data Persistence team as well, then I will merge and apply." [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [08:30:57] !log jnuche@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.24 refs T373643 [08:31:02] T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643 [08:31:06] train prep failed last night, I'm re-running it [08:34:23] (03PS7) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) [08:36:11] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223 [08:36:16] T375223: 2024-09-21 NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223 [08:36:25] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223 [08:36:38] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetmaster2001.codfw.wmnet with reason: WIP - working on puppet runs [08:36:52] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetmaster2001.codfw.wmnet with reason: WIP - working on puppet runs [08:37:17] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye [08:37:45] (03CR) 10Slyngshede: Account blocking (0311 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [08:38:44] (03CR) 10Slyngshede: "See comment regarding moving blocking logic to bitu-ldap. It feels like the correct move, but I'd like to do that in a second step." [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [08:41:10] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:41:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:43:30] !log jnuche@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.24 refs T373643 [08:43:34] T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643 [08:46:13] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [08:49:46] (03CR) 10Muehlenhoff: Account blocking (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [08:51:35] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1177.eqiad.wmnet with reason: host reimage [08:53:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:55:41] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1177.eqiad.wmnet with reason: host reimage [08:56:20] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10170311 (10Jelto) I reviewed the [throttling in the past 7 days](https://grafana.wikimedia... [08:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:56:42] (03PS2) 10Stevemunene: hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1072660 (https://phabricator.wikimedia.org/T353788) [08:57:32] (03PS4) 10Stevemunene: hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788) [08:59:51] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1176.eqiad.wmnet with reason: host reimage [09:03:10] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1176.eqiad.wmnet with reason: host reimage [09:03:34] !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2426.codfw.wmnet [09:04:12] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2426.codfw.wmnet [09:04:36] PROBLEM - config-master.wikimedia.org requires authentication on puppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:04:45] !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2427.codfw.wmnet [09:05:00] (03PS2) 10Hnowlan: mediawiki: remove check_mw_versions [puppet] - 10https://gerrit.wikimedia.org/r/1074189 (https://phabricator.wikimedia.org/T374860) [09:05:19] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2427.codfw.wmnet [09:05:47] (03PS5) 10Stevemunene: hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788) [09:07:33] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2426.codfw.wmnet with reason: reimage [09:07:46] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2426.codfw.wmnet with reason: reimage [09:07:53] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2427.codfw.wmnet with reason: reimage [09:08:06] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2427.codfw.wmnet with reason: reimage [09:09:22] 10SRE-tools, 06Infrastructure-Foundations: debmonitor could provide users with cumin and/or debdeploy pre-made config/command - https://phabricator.wikimedia.org/T375475 (10fgiunchedi) 03NEW [09:10:03] (03CR) 10Btullis: [C:03+1] hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [09:10:28] (03CR) 10Hnowlan: [C:03+2] "Good catch, removed!" [puppet] - 10https://gerrit.wikimedia.org/r/1074189 (https://phabricator.wikimedia.org/T374860) (owner: 10Hnowlan) [09:11:26] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1177.eqiad.wmnet with OS bullseye [09:11:47] FIRING: HelmReleaseBadStatus: Helm release mw-web/canary on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:14:24] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:14:26] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:14:47] jnuche: tell me what is you scap status [09:15:33] effie: still running, I'll give an update once it's finished [09:15:45] while I depooled and cordoned and mark both hosts as unschedulable, I may still have stepped on your toes [09:17:04] the deployment is rolling back, but this is a problem that already happened last night [09:17:24] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 291, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:17:24] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 373, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:17:25] the deployments to K8s have become much slower and are timing out [09:17:40] I see, is there a task? we should probably look into it [09:18:17] I haven't created a task yet, I think the issue may be related to https://phabricator.wikimedia.org/T366778 [09:18:56] (03CR) 10Ayounsi: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [09:19:11] (03CR) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [09:19:13] (03PS6) 10Ayounsi: Add monitoring to network devices gRPC endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1074435 [09:19:20] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [09:19:37] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1176.eqiad.wmnet with OS bullseye [09:20:10] (03CR) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [09:21:32] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10170362 (10dcaro) Okok, let's take the 8 drives from cloudcephosd1025 on rack E4 to send them, let me drain it firs... [09:21:46] !log jnuche@deploy1003 scap failed: local variable 'e' referenced before assignment (scap version: 4.104.0-1) (duration: 38m 15s) [09:21:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-web/canary on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:22:10] effie: rolled back completed, I'm creating the task [09:22:13] (03CR) 10Muehlenhoff: "Looks good. One final inline. One other thing we should consider in a separate follwup is notifications: We should probably send a notific" [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [09:23:22] (03PS2) 10Slyngshede: Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) [09:24:22] (03PS8) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) [09:25:09] (03CR) 10Slyngshede: Account blocking (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [09:25:52] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw [09:25:52] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=93) for the switch from eqiad to codfw [09:26:50] (03PS7) 10Filippo Giunchedi: Add monitoring to network devices gRPC endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [09:27:40] arnaudb: what was this run of prepare? [09:27:44] !log upgrade mtail on lists* and ncredir* https://phabricator.wikimedia.org/T375085 [09:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:52] yes but on patch sorry [09:28:07] I'm testing w/ jynus on a tmux 1073762 [09:28:48] effie: task https://phabricator.wikimedia.org/T375477 [09:28:55] --no-sal-logging in use from now [09:28:56] sorry [09:29:01] ah ok, no worries [09:29:21] jnuche: tx [09:29:26] just checking, I was slightly worried, but also the cookbook should have enough checks at the start to prevent runs if it was already done [09:29:48] (03PS3) 10Andrew Bogott: Make cloudcephosd1039-1041 into ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1063892 (https://phabricator.wikimedia.org/T372814) [09:31:50] (03CR) 10David Caro: [C:03+2] Make cloudcephosd1039-1041 into ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1063892 (https://phabricator.wikimedia.org/T372814) (owner: 10Andrew Bogott) [09:32:07] (03PS8) 10Filippo Giunchedi: Add monitoring to network devices gRPC endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [09:33:23] !log jiji@cumin1002 START - Cookbook sre.hosts.provision for host mw2426.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [09:34:12] (03PS1) 10Cyndywikime: Drop support for the Old Impact Variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075148 (https://phabricator.wikimedia.org/T350077) [09:35:46] (03PS2) 10Cyndywikime: Drop support for the old impact variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075148 (https://phabricator.wikimedia.org/T350077) [09:36:26] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:36:26] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:39:13] (03CR) 10JMeybohm: Initial commit of containerd puppet code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:39:17] (03PS1) 10Effie Mouzeli: kubernetes: rename mw2426 -> wikikube-worker2126 [puppet] - 10https://gerrit.wikimedia.org/r/1075149 (https://phabricator.wikimedia.org/T372878) [09:40:26] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 291, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:40:26] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 373, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:42:10] (03CR) 10JMeybohm: [C:03+1] kubernetes: rename mw2426 -> wikikube-worker2126 [puppet] - 10https://gerrit.wikimedia.org/r/1075149 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [09:42:25] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: rename mw2426 -> wikikube-worker2126 [puppet] - 10https://gerrit.wikimedia.org/r/1075149 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [09:43:12] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: rename mw2426 -> wikikube-worker2126 [puppet] - 10https://gerrit.wikimedia.org/r/1075149 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [09:46:33] (03PS9) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) [09:46:38] (03PS1) 10Btullis: partman: Enable use of the second disk for dse-k8s local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075151 (https://phabricator.wikimedia.org/T365283) [09:46:50] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm [09:47:18] (03CR) 10Btullis: [C:03+2] partman: Enable use of the second disk for dse-k8s local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075151 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [09:48:45] (03PS1) 10Muehlenhoff: Remove obsolete rendering certs [puppet] - 10https://gerrit.wikimedia.org/r/1075152 (https://phabricator.wikimedia.org/T357750) [09:48:45] (03CR) 10Giuseppe Lavagetto: git: add replicated_local_repo define (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [09:49:01] (03PS3) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) [09:49:01] (03PS3) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) [09:49:01] (03PS3) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [09:49:01] (03PS1) 10Giuseppe Lavagetto: service: make legacy function work with puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075153 [09:51:55] (03CR) 10CI reject: [V:04-1] git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [09:51:56] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2426.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [09:52:23] (03CR) 10CI reject: [V:04-1] profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 (owner: 10Giuseppe Lavagetto) [09:52:40] !log jiji@cumin1002 START - Cookbook sre.hosts.rename from mw2426 to wikikube-worker2126 [09:52:51] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [09:54:09] <_Gerges> Ping [09:54:30] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:54:30] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:55:47] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [09:56:19] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2426 to wikikube-worker2126 - jiji@cumin1002" [09:56:48] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10170568 (10Jelto) [09:56:54] <_Gerges> Hi, someone can take on both tasks T375055 and T375054, I'll be busy this weekend. [09:56:55] T375055: Requesting logo change for bjn.wikipedia.org - https://phabricator.wikimedia.org/T375055 [09:56:55] T375054: Requesting logo change for bjn.wikiquote.org - https://phabricator.wikimedia.org/T375054 [09:57:14] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2426 to wikikube-worker2126 - jiji@cumin1002" [09:57:14] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:57:15] !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2126 [09:57:47] (03PS4) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) [09:57:47] (03PS4) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) [09:57:47] (03PS4) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1000) [10:00:18] (03PS10) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) [10:00:44] (03CR) 10CI reject: [V:04-1] git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [10:00:56] (03CR) 10CI reject: [V:04-1] profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 (owner: 10Giuseppe Lavagetto) [10:01:28] !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2126 [10:02:06] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2426 to wikikube-worker2126 [10:03:28] !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2126.codfw.wmnet on all recursors [10:03:31] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2126.codfw.wmnet on all recursors [10:03:55] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:21] !log jiji@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2126.codfw.wmnet [10:05:42] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2126.codfw.wmnet with OS bullseye [10:05:52] !log jiji@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2126 [10:06:04] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [10:09:23] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [10:09:39] 10SRE-swift-storage: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448#10170593 (10MatthewVernon) I've confirmed that neither production swift cluster contains the object. And as far back as the swift logs go, we've only ever said 404 to requests for... [10:10:07] !log jnuche@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.24 refs T373643 [10:10:12] T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643 [10:10:43] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:11:23] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2126 - jiji@cumin1002" [10:11:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2126 - jiji@cumin1002" [10:11:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:11:27] !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2126.codfw.wmnet 82.0.192.10.in-addr.arpa 2.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:11:30] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2126.codfw.wmnet 82.0.192.10.in-addr.arpa 2.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:11:31] !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2126 [10:11:53] !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2126 [10:11:53] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2126 [10:13:20] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [10:13:44] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1176.eqiad.wmnet [10:14:31] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1176.eqiad.wmnet [10:14:59] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1176.eqiad.wmnet [10:16:32] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1176.eqiad.wmnet [10:16:46] (03CR) 10Gmodena: [C:03+1] Declare streams in support of the reconciliation mechanism for Dumps 2.0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo) [10:17:08] (03PS5) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) [10:17:23] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1176.eqiad.wmnet [10:19:48] (03PS1) 10Effie Mouzeli: kubernetes: rename mw2427 -> wikikube-worker2127 [puppet] - 10https://gerrit.wikimedia.org/r/1075158 (https://phabricator.wikimedia.org/T372878) [10:21:07] (03PS2) 10Effie Mouzeli: kubernetes: rename mw2427 -> wikikube-worker2127 [puppet] - 10https://gerrit.wikimedia.org/r/1075158 (https://phabricator.wikimedia.org/T372878) [10:22:59] (03PS3) 10Effie Mouzeli: kubernetes: rename mw2427 -> wikikube-worker2127 [puppet] - 10https://gerrit.wikimedia.org/r/1075158 (https://phabricator.wikimedia.org/T372878) [10:25:30] !log force deletion of older thanos blocks - T351927 [10:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:34] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [10:25:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [10:25:53] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: rename mw2427 -> wikikube-worker2127 [puppet] - 10https://gerrit.wikimedia.org/r/1075158 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [10:26:56] (03CR) 10Btullis: [C:03+1] "This change gets a +1 from me, but I'm also adding hnowlan for review." [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [10:28:15] FIRING: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:12] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2126.codfw.wmnet with reason: host reimage [10:30:23] RESOLVED: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:26] (03CR) 10Hnowlan: [C:03+1] "Thanks for the heads-up!" [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [10:32:47] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [10:34:03] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2126.codfw.wmnet with reason: host reimage [10:36:41] (03CR) 10Btullis: [C:03+2] hieradata::services_proxy::envoy.yaml: fix duplicated port [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [10:40:36] (03CR) 10Gmodena: [C:03+2] dse-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [10:41:34] (03Merged) 10jenkins-bot: dse-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [10:43:09] (03PS1) 10Hnowlan: aqs: remove AQSv1 service components [puppet] - 10https://gerrit.wikimedia.org/r/1075163 (https://phabricator.wikimedia.org/T350143) [10:44:12] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4095/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075163 (https://phabricator.wikimedia.org/T350143) (owner: 10Hnowlan) [10:45:44] (03CR) 10Hnowlan: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1075153 (owner: 10Giuseppe Lavagetto) [10:47:38] (03CR) 10Lucas Werkmeister (WMDE): SSO domain shouldn't have a mobile version (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński) [10:54:56] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:57:18] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2126.codfw.wmnet with OS bullseye [10:58:52] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:58:57] (03CR) 10Muehlenhoff: Block User: Add LDAP blocking/unblocking. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [11:10:56] 10SRE-swift-storage, 10media-backups: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448#10170718 (10jcrespo) Following with the usual preference, I would like to do a new upload rather than overwriting (for the end user it should have the same effe... [11:12:45] (03CR) 10FNegri: [C:03+1] [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [11:13:29] 10SRE-swift-storage, 10media-backups: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448#10170728 (10jcrespo) p:05Triage→03High [11:13:35] (03PS1) 10Slyngshede: Dummy secrets for IDM account blocking. [labs/private] - 10https://gerrit.wikimedia.org/r/1075174 [11:14:07] 10SRE-swift-storage, 10media-backups: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448#10170726 (10jcrespo) 05Open→03In progress a:03prabhat [11:16:13] (03CR) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [11:17:32] (03PS2) 10Hnowlan: aqs: remove AQSv1 service components [puppet] - 10https://gerrit.wikimedia.org/r/1075163 (https://phabricator.wikimedia.org/T350143) [11:19:34] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4096/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075163 (https://phabricator.wikimedia.org/T350143) (owner: 10Hnowlan) [11:20:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [11:21:33] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10170740 (10MoritzMuehlenhoff) [11:25:13] (03PS1) 10Btullis: partman: Fix allocation of sdb for dse-k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1075181 (https://phabricator.wikimedia.org/T365283) [11:25:56] (03CR) 10Btullis: [C:03+2] partman: Fix allocation of sdb for dse-k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1075181 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [11:27:05] (03PS1) 10Bartosz Dziewoński: Temporarily allow core password reset functionality [extensions/CentralAuth] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075182 (https://phabricator.wikimedia.org/T151012) [11:27:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075182 (https://phabricator.wikimedia.org/T151012) (owner: 10Bartosz Dziewoński) [11:27:27] (03CR) 10Slyngshede: [V:03+2 C:03+2] Dummy secrets for IDM account blocking. [labs/private] - 10https://gerrit.wikimedia.org/r/1075174 (owner: 10Slyngshede) [11:28:28] PROBLEM - Host mc2038 is DOWN: PING CRITICAL - Packet loss = 100% [11:29:13] expected? [11:30:33] (03PS1) 10Muehlenhoff: Stop uploading puppet facts to PCC from puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) [11:30:45] (03PS2) 10Muehlenhoff: Stop uploading puppet facts to PCC from puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) [11:30:46] doesn't respond on ipv4 or ipv6 [11:31:20] !log homer cr*codfw* commit 'T372878' [11:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:24] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [11:31:30] !log homer lsw1-a6-codfw* commit 'T372878' [11:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:50] vgutierrez: I will take a look in a bit, but it is not a problem the host being down [11:31:53] there's a CPU error logged on mc2038 [11:32:03] CPU 1 machine check error detected [11:32:06] in SEL [11:32:09] excellent [11:32:37] under warranty for two more months fortunately [11:33:08] just in time(TM) [11:33:15] moritzm: can you please paste whatever you have in a phab paste? I will open a task [11:34:49] effie: sure: https://phabricator.wikimedia.org/P69399 [11:35:10] cheers tx [11:36:09] I tried to have a look at the system itself, but seems the console died along, can't connect [11:36:10] (03PS1) 10Cyndywikime: Remove wgGEUseNewImpactModule config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075196 (https://phabricator.wikimedia.org/T350077) [11:36:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) (owner: 10Muehlenhoff) [11:38:27] !log installing systemd bugfix updates from Bookworm point release [11:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:48] !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2126.codfw.wmnet [11:38:50] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2126.codfw.wmnet [11:38:51] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2126.codfw.wmnet [11:39:04] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [11:39:07] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [11:40:52] (03PS2) 10Abijeet Patro: Enable translation settings banner for Test wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075189 (https://phabricator.wikimedia.org/T372460) [11:40:56] (03PS1) 10JMeybohm: prometheus::node_exporter: Don't exclude /var/lib/(docker|kubelet) [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488) [11:42:36] (03PS3) 10Slyngshede: C:idm Add configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) [11:42:46] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [11:44:32] (03PS4) 10Slyngshede: C:idm Add configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) [11:44:45] (03CR) 10Muehlenhoff: C:idm Add configuration for account blocking. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [11:44:54] (03CR) 10D3r1ck01: [C:03+1] Temporarily allow core password reset functionality [extensions/CentralAuth] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075182 (https://phabricator.wikimedia.org/T151012) (owner: 10Bartosz Dziewoński) [11:45:15] (03PS2) 10Btullis: Add an airflow cluster and assign relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074391 (https://phabricator.wikimedia.org/T374932) [11:45:15] (03PS2) 10Btullis: Add a presto cluster and assign the relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074430 (https://phabricator.wikimedia.org/T374932) [11:45:20] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4098/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [11:45:23] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [11:45:26] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [11:45:34] (03PS2) 10JMeybohm: prometheus::node_exporter: Don't exclude /var/lib/(docker|kubelet) [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488) [11:46:53] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4099/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm) [11:46:58] (03PS5) 10Slyngshede: C:idm Add configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) [11:47:09] (03CR) 10Btullis: [C:03+2] Add a presto cluster and assign the relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074430 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [11:47:27] (03CR) 10Bartosz Dziewoński: SSO domain shouldn't have a mobile version (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński) [11:47:31] (03CR) 10Btullis: [C:03+2] Add an airflow cluster and assign relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074391 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [11:47:46] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4100/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [11:48:20] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [11:48:52] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:48:57] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [11:52:52] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 289, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:54:56] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:41] (03CR) 10Slyngshede: [V:03+1] C:idm Add configuration for account blocking. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [11:57:22] (03PS5) 10Bartosz Dziewoński: SSO domain shouldn't have a mobile version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) [11:58:19] (03CR) 10Ladsgroup: "I honestly prefer a way to set a section and run them manually so I have control over which sections should be done and in which order but" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans) [11:58:20] (03CR) 10Slyngshede: [C:03+2] Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [11:58:54] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 371, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:59:31] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2427.codfw.wmnet with reason: reimage [11:59:33] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2427.codfw.wmnet with reason: reimage [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1200) [12:00:56] (03Merged) 10jenkins-bot: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [12:01:34] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [12:05:13] !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2427.codfw.wmnet [12:05:14] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2427.codfw.wmnet [12:05:27] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [12:05:44] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: rename mw2427 -> wikikube-worker2127 [puppet] - 10https://gerrit.wikimedia.org/r/1075158 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [12:07:48] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:07:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:08:34] (03PS1) 10Santiago Faci: MPIC: Deploying on staging a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473) [12:08:49] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 289, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:08:50] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 371, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:09:42] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:09:46] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:12:28] !log running db-compare on s2, s3 T375186 [12:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:33] T375186: databases preswitchover checks - https://phabricator.wikimedia.org/T375186 [12:12:48] !log jiji@cumin1002 START - Cookbook sre.hosts.provision for host mw2427.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [12:13:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [12:13:19] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:13:26] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:13:41] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#10170935 (10JMeybohm) [12:14:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:14:42] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:15:05] (03PS1) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) [12:15:48] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:15:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:16:11] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:16:17] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:16:18] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2427.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [12:17:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [12:17:28] !log jiji@cumin1002 START - Cookbook sre.hosts.rename from mw2427 to wikikube-worker2127 [12:17:44] !log jiji@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw2427 to wikikube-worker2127 [12:19:49] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 289, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:19:50] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 371, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:20:13] (03CR) 10Bartosz Dziewoński: "This is just an idea, I'm not sure about it, but I hope it's a good one. Thoughts?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) (owner: 10Bartosz Dziewoński) [12:21:03] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [12:21:14] !log jiji@cumin1002 START - Cookbook sre.hosts.rename from mw2427 to wikikube-worker2127 [12:21:23] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [12:22:12] (03Abandoned) 10Muehlenhoff: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [12:24:44] (03CR) 10Ayounsi: [C:03+2] Add monitoring to network devices gRPC endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [12:24:53] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [12:25:06] (03CR) 10Elukey: [C:03+2] services: update Tegola's Docker image to pick up package upgrades [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073818 (https://phabricator.wikimedia.org/T373976) (owner: 10Elukey) [12:25:46] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on mc2038.codfw.wmnet with reason: CPU failure - T375495 [12:25:50] T375495: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495 [12:26:00] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on mc2038.codfw.wmnet with reason: CPU failure - T375495 [12:27:01] 10ops-codfw, 06DC-Ops: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495 (10jijiki) 03NEW [12:27:38] 10ops-codfw, 06DC-Ops: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495#10171003 (10jijiki) [12:28:05] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:28:10] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:28:25] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2427 to wikikube-worker2127 - jiji@cumin1002" [12:29:10] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2427 to wikikube-worker2127 - jiji@cumin1002" [12:29:10] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:29:11] !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2127 [12:29:13] (03PS2) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) [12:29:17] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [12:29:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2127 [12:29:36] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [12:30:02] (03PS3) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) [12:30:04] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2427 to wikikube-worker2127 [12:30:28] (03PS1) 10Slyngshede: Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075218 [12:30:46] !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2127.codfw.wmnet on all recursors [12:30:49] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [12:30:49] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2127.codfw.wmnet on all recursors [12:30:50] (03CR) 10Jelto: [C:03+1] "lgtm as far as I can tell. After merging this the previous fix for gitlab-runners I06f578e23689c29be78eb888f1a8bbbf60b249f9 can be reverte" [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm) [12:30:57] (03PS3) 10Muehlenhoff: Stop uploading puppet facts to PCC from puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) [12:31:06] (03CR) 10Muehlenhoff: [C:04-1] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) (owner: 10Muehlenhoff) [12:31:19] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [12:31:50] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1075218 (owner: 10Slyngshede) [12:32:06] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [12:32:08] (03Abandoned) 10Slyngshede: Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [12:32:26] !log jiji@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2127.codfw.wmnet [12:32:40] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [12:32:48] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2127.codfw.wmnet with OS bullseye [12:32:58] !log jiji@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2127 [12:33:06] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [12:33:58] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10171032 (10jijiki) a:03jijiki [12:35:31] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10171060 (10MoritzMuehlenhoff) [12:35:45] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10171075 (10MoritzMuehlenhoff) [12:35:58] FIRING: [2x] CertAlmostExpired: Certificate for service cr1-esams.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:36:17] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2127 - jiji@cumin1002" [12:36:21] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2127 - jiji@cumin1002" [12:36:21] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:36:21] !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2127.codfw.wmnet 83.0.192.10.in-addr.arpa 3.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:36:24] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2127.codfw.wmnet 83.0.192.10.in-addr.arpa 3.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:36:25] !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2127 [12:36:47] !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2127 [12:36:47] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2127 [12:37:47] (03CR) 10Volans: "This is just the official list of all core sections that can be used in cookbooks as they wish. For example it can be used as an argument " [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans) [12:38:13] FIRING: [3x] JobUnavailable: Reduced availability for job probes/grpc in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:38:56] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] SSO domain shouldn't have a mobile version (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński) [12:39:54] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:39:54] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:40:29] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:40:58] FIRING: [4x] CertAlmostExpired: Certificate for service cr1-eqiad.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:43:14] RECOVERY - Disk space on thanos-be1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops [12:43:14] RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [12:43:14] RECOVERY - Disk space on thanos-be2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [12:44:57] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:48:04] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:48:18] (03PS5) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) [12:48:18] (03PS5) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [12:48:18] (03PS1) 10Giuseppe Lavagetto: puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) [12:48:44] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:53:13] FIRING: [5x] JobUnavailable: Reduced availability for job probes/grpc in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:54:33] (03PS1) 10Alexandros Kosiaris: canaries: Recreate instead of RollingUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075221 (https://phabricator.wikimedia.org/T375477) [12:55:18] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2127.codfw.wmnet with reason: host reimage [12:55:58] FIRING: [10x] CertAlmostExpired: Certificate for service cr1-codfw.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:56:33] (03CR) 10Effie Mouzeli: [C:03+1] canaries: Recreate instead of RollingUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075221 (https://phabricator.wikimedia.org/T375477) (owner: 10Alexandros Kosiaris) [12:58:04] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): scale back to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [12:58:20] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2127.codfw.wmnet with reason: host reimage [12:59:33] (03CR) 10Slyngshede: [C:03+2] Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075218 (owner: 10Slyngshede) [12:59:51] (03CR) 10Slyngshede: [V:03+1 C:03+2] C:idm Add configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1300). [13:00:05] hnowlan and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:26] hi [13:00:28] all yours Lucas :D [13:00:36] ok :3 [13:00:40] o/ [13:01:00] Mine is yet another only-takes-effect-on-jobrunners change that can't be tested on debug [13:01:22] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Apply videoscaler request limits and wall clock time limits to shellbox-video (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [13:01:26] ack [13:01:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [13:01:45] right, let’s try this merging without explicit rebase thing [13:01:50] since gerrit should now autorebase [13:01:58] * Lucas_WMDE sees a lot of unnormalized errors in logspam-watch :/ [13:02:28] (03Merged) 10jenkins-bot: Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075218 (owner: 10Slyngshede) [13:02:32] * Lucas_WMDE looks how long CentralAuth CI usually takes [13:02:34] (03Merged) 10jenkins-bot: Apply videoscaler request limits and wall clock time limits to shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [13:02:36] ten minutes, ok [13:02:42] then let’s not +2 that backport just yet I think [13:02:54] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1073840|Apply videoscaler request limits and wall clock time limits to shellbox-video (T373517)]] [13:03:02] T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517 [13:03:14] RECOVERY - Disk space on thanos-be1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops [13:03:14] RECOVERY - Disk space on thanos-be1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1004&var-datasource=eqiad+prometheus/ops [13:03:14] RECOVERY - Disk space on thanos-be2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [13:03:14] RECOVERY - Disk space on thanos-be2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [13:04:31] (03PS1) 10Gmodena: dse-k8s-services: fix values in dump enrichment app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075226 (https://phabricator.wikimedia.org/T368787) [13:04:56] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:01] (03PS1) 10Slyngshede: P:idm Syntax error in settings. [puppet] - 10https://gerrit.wikimedia.org/r/1075227 [13:05:46] Lucas_WMDE: they say that CI has just gotten much faster, btw: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/XQZNOGXOJP62NSNHG24HIMOYWP5CG737/ [13:06:13] (03CR) 10Slyngshede: [C:03+2] P:idm Syntax error in settings. [puppet] - 10https://gerrit.wikimedia.org/r/1075227 (owner: 10Slyngshede) [13:06:25] ah, true [13:06:36] makes sense that selenium was the slowest of the jobs on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1075045 then [13:06:45] (03CR) 10Alexandros Kosiaris: [C:03+2] canaries: Recreate instead of RollingUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075221 (https://phabricator.wikimedia.org/T375477) (owner: 10Alexandros Kosiaris) [13:06:48] (I don’t think that one is parallelized everywhere yet) [13:07:00] FIRING: [2x] ProbeDown: Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:08:19] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, hnowlan: Backport for [[gerrit:1073840|Apply videoscaler request limits and wall clock time limits to shellbox-video (T373517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:24] T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517 [13:08:43] (03CR) 10Muehlenhoff: [C:03+2] envoy: Add support for passing an array of sets to the firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff) [13:08:45] Lucas_WMDE: would you mind if we added a simple config change to the end of the window? [13:09:19] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, hnowlan: Continuing with sync [13:09:29] ottomata: sure, go ahead [13:10:23] 10SRE-swift-storage, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Remove load_average check for ms-be/thanos-be - https://phabricator.wikimedia.org/T370526#10171322 (10fgiunchedi) [13:10:29] Actually, Lucas_WMDE let me know when you are done with the window and I will deploy it. (easier than editing the calendar :p ) [13:11:26] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm [13:11:29] ottomata: ok :P [13:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:00] RESOLVED: [2x] ProbeDown: Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:12:14] (03CR) 10Muehlenhoff: [C:04-1] "The underlying patch is now merged, but still need to be updated to use firewall_src_sets" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [13:12:36] (03CR) 10Muehlenhoff: [C:04-1] "The underlying patch is now merged, but still need to be updated to use firewall_src_sets" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [13:12:40] Lucas_WMDE: did by any chance canary "feel faster"? [13:13:02] no idea, I wasn’t looking very closely at scap tbh [13:13:07] I can scroll up and see how long it took [13:13:32] sync-canaries-k8s was apparently 2m36s [13:13:52] that's faster alright [13:14:10] it was like 4+ previously [13:14:17] nice! [13:14:29] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10171352 (10Papaul) @ABran-WMF you welcome. [13:14:31] \o/ [13:15:11] (03PS1) 10Slyngshede: P:idm add account managers for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1075230 [13:16:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] SSO domain shouldn't have a mobile version (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński) [13:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:18:28] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2127.codfw.wmnet with OS bullseye [13:20:06] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073840|Apply videoscaler request limits and wall clock time limits to shellbox-video (T373517)]] (duration: 17m 12s) [13:20:11] T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517 [13:20:27] Thanks Lucas_WMDE! [13:20:33] np! [13:20:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński) [13:20:56] (03CR) 10Muehlenhoff: [C:03+1] P:idm add account managers for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1075230 (owner: 10Slyngshede) [13:21:09] !log homer lsw1-a6-codfw* commit 'T372878 [13:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:13] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [13:21:18] (03Merged) 10jenkins-bot: SSO domain shouldn't have a mobile version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński) [13:21:39] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1074418|SSO domain shouldn't have a mobile version (T375272)]] [13:21:43] T375272: Beta cluster SSO domain has a mobile version, but shouldn't - https://phabricator.wikimedia.org/T375272 [13:23:14] RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [13:23:51] !log homer cr*codfw* commit T372878 [13:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:39] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075182 (https://phabricator.wikimedia.org/T151012) (owner: 10Bartosz Dziewoński) [13:25:17] !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye [13:25:55] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm [13:27:02] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1074418|SSO domain shouldn't have a mobile version (T375272)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:27:07] T375272: Beta cluster SSO domain has a mobile version, but shouldn't - https://phabricator.wikimedia.org/T375272 [13:27:38] Lucas_WMDE: that config change is currently only testable on the beta cluster [13:27:43] was just about to ask, yeah [13:27:44] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Continuing with sync [13:27:46] !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2127.codfw.wmnet [13:27:48] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2127.codfw.wmnet [13:27:48] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2127.codfw.wmnet [13:30:13] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 287, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:04] (2m45s for sync-canaries-k8s this time btw) [13:31:34] (03CR) 10David Caro: "I think this broke puppet runs on toolforge prometheus :/" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [13:32:44] (03CR) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo) [13:33:37] (03PS2) 10EoghanGaffney: lists: Roll out nftables on both list hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073189 [13:34:08] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm [13:34:30] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074418|SSO domain shouldn't have a mobile version (T375272)]] (duration: 12m 51s) [13:34:35] T375272: Beta cluster SSO domain has a mobile version, but shouldn't - https://phabricator.wikimedia.org/T375272 [13:35:02] (03Merged) 10jenkins-bot: Temporarily allow core password reset functionality [extensions/CentralAuth] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075182 (https://phabricator.wikimedia.org/T151012) (owner: 10Bartosz Dziewoński) [13:35:17] (03PS2) 10Cyndywikime: Remove wgGEUseNewImpactModule config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075196 (https://phabricator.wikimedia.org/T350077) [13:35:41] (03CR) 10Gmodena: [C:03+2] Declare streams in support of the reconciliation mechanism for Dumps 2.0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo) [13:35:54] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1075182|Temporarily allow core password reset functionality (T151012)]] [13:35:59] T151012: CentralAuth should have its own temporary password handling - https://phabricator.wikimedia.org/T151012 [13:36:16] (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073189 (owner: 10EoghanGaffney) [13:36:26] (03Merged) 10jenkins-bot: Declare streams in support of the reconciliation mechanism for Dumps 2.0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo) [13:37:19] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 369, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:37:41] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: cr3-ulsfo rebooted, repooling ulsfo, T375345] [13:37:45] T375345: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345 [13:37:49] (03PS1) 10Ayounsi: gNMI prometheus check: add specific network CA cert [puppet] - 10https://gerrit.wikimedia.org/r/1075235 (https://phabricator.wikimedia.org/T375513) [13:37:53] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: cr3-ulsfo rebooted, repooling ulsfo, T375345] [13:38:18] hm, one testserver check failed [13:38:31] https://zero.wikipedia.org/ – expected 301, got 503 [13:38:39] and https://login.wikimedia.org/wiki/Special:Log/renameuser – expected 200, got 503 [13:38:44] (*two testserver checks) [13:38:58] let’s see if I can find those in logstash… [13:39:35] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075235 (https://phabricator.wikimedia.org/T375513) (owner: 10Ayounsi) [13:40:05] grmbl, can’t find anything in logstash [13:40:32] I guess I’ll retry… [13:40:38] https://login.wikimedia.org/wiki/Special:Log/renameuser works for me on mwdebug2001 at least [13:41:06] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1075182|Temporarily allow core password reset functionality (T151012)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:41:09] Lucas_WMDE: the CentralAuth change also isn't really testable on mwdebug - unless you requested a password reset on a wiki curretly running wmf.24 before wmf.24 went out [13:41:11] T151012: CentralAuth should have its own temporary password handling - https://phabricator.wikimedia.org/T151012 [13:41:18] ah, I see [13:41:35] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Continuing with sync [13:41:38] let’s try our luck then [13:41:50] (retrying the testserver checks worked apparently btw) [13:41:56] (03CR) 10Slyngshede: [C:03+2] P:idm add account managers for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1075230 (owner: 10Slyngshede) [13:42:11] o_O [13:42:20] !log installing krb5 security updates [13:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:34] goddammit now I’ve got a cached redirect from wikitech to foundationwiki [13:42:42] which doesn’t go away even if I turn of WikimediaDebug again [13:43:01] (and I’m lucky that I even know that it’s related to WikimediaDebug at all because I happen to have heard of it somewhere) [13:43:33] yay, disabling cache in the network panel in dev tools fixed it… [13:44:02] (03CR) 10David Caro: "Hmm, is there a new blackbox-exporter version needed for the grpc config? I think it's not understanding it, line 19 of the generated conf" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [13:46:15] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075182|Temporarily allow core password reset functionality (T151012)]] (duration: 10m 21s) [13:46:20] T151012: CentralAuth should have its own temporary password handling - https://phabricator.wikimedia.org/T151012 [13:46:28] (0m34s for sync-canaries-k8s this time! :o) [13:46:33] (cc akosiaris) [13:46:36] ottomata: all yours [13:47:07] !incidents [13:47:08] 5274 (RESOLVED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [13:47:08] 5273 (RESOLVED) db1246 (paged)/MariaDB Replica Lag: s2 (paged) [13:47:08] 5272 (RESOLVED) db1246 (paged)/MariaDB Replica SQL: s2 (paged) [13:47:08] 5271 (RESOLVED) db1246 (paged)/mysqld processes (paged) [13:47:08] 5267 (RESOLVED) Host pc1013 (paged) - PING - Packet loss = 100% [13:47:43] Lucas_WMDE: woohoo! [13:47:51] Lucas_WMDE: ty [13:48:00] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage [13:48:02] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1073189 (owner: 10EoghanGaffney) [13:49:31] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Roll out nftables on both list hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073189 (owner: 10EoghanGaffney) [13:50:46] (03CR) 10Alexandros Kosiaris: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) (owner: 10Bartosz Dziewoński) [13:50:46] (03CR) 10Filippo Giunchedi: [C:03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm) [13:51:07] (03CR) 10Ottomata: [C:03+2] EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) (owner: 10Joal) [13:51:09] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage [13:51:26] (03PS4) 10Joal: EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) [13:51:36] (03CR) 10Ottomata: [V:03+2 C:03+2] EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) (owner: 10Joal) [13:51:49] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM from a Prometheus perspective" [puppet] - 10https://gerrit.wikimedia.org/r/1075235 (https://phabricator.wikimedia.org/T375513) (owner: 10Ayounsi) [13:53:13] dcaro: ah yes pretty sure it is the blackbox-exporter version mismatch [13:53:30] dcaro: prometheus runs on bookworm in production FWIW, I did an in-place upgrade a couple of weeks back and all was good [13:53:52] (03PS1) 10Ayounsi: Allow prometheus hosts to reach gnmi port [homer/public] - 10https://gerrit.wikimedia.org/r/1075237 [13:53:57] !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1074102|EventStreamConfig: Disable regex steam hadoop ingestion (T361498)]] [13:54:01] T361498: [Refine Refactoring] Detect inactive event streams / Refine datasets using data recency thresholds - https://phabricator.wikimedia.org/T361498 [13:54:02] we still have bullseye on toolforge [13:54:19] is there a version of the exporter available for bullseye somewhere? [13:54:35] checking [13:54:41] (should not be hard to upgrade the OS, but will need some time) [13:55:53] !log otto@deploy1003 otto, joal: Backport for [[gerrit:1074102|EventStreamConfig: Disable regex steam hadoop ingestion (T361498)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:56:31] (03PS1) 10Joal: Remove labswiki from HDFS imported dumps [puppet] - 10https://gerrit.wikimedia.org/r/1075238 (https://phabricator.wikimedia.org/T217792) [13:57:17] dcaro: can't find a backported version for bullseye no :( it is an hack though copying just the binary should work [13:59:26] hmmm, hacky, would it be too hard to package the binary only? [13:59:36] !log otto@deploy1003 otto, joal: Continuing with sync [13:59:38] Lucas_WMDE: thanks for deploying! [13:59:49] np! [14:00:08] (03CR) 10Milimetric: [C:03+1] Remove labswiki from HDFS imported dumps [puppet] - 10https://gerrit.wikimedia.org/r/1075238 (https://phabricator.wikimedia.org/T217792) (owner: 10Joal) [14:02:48] akosiaris: 38 seconds for canaries for me! [14:03:12] godog: yep, using the 0.23 binary works ok [14:03:55] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:56] dcaro: yeah, technically it is "just" the libc6 versioned dependency that makes the package uninstallable on bullseye [14:04:01] !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074102|EventStreamConfig: Disable regex steam hadoop ingestion (T361498)]] (duration: 10m 04s) [14:04:06] but that's a lie in practice [14:04:15] T361498: [Refine Refactoring] Detect inactive event streams / Refine datasets using data recency thresholds - https://phabricator.wikimedia.org/T361498 [14:04:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1044.eqiad.wmnet with OS bookworm [14:06:32] ottomata: heya, are you finished with your backports? [14:07:18] (03PS1) 10Majavah: P:toolforge: docker: registry: Add option to redirect domain root [puppet] - 10https://gerrit.wikimedia.org/r/1075240 (https://phabricator.wikimedia.org/T375515) [14:08:15] (03CR) 10JMeybohm: [V:03+1 C:03+2] prometheus::node_exporter: Don't exclude /var/lib/(docker|kubelet) [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm) [14:08:23] PROBLEM - Host lists1004 is DOWN: CRITICAL - Host Unreachable (208.80.154.81) [14:08:25] jnuche: ottomata: where do things stand on deployments at the moment? I have one patch I need to merge/apply to scale things up ahead of the traffic switchover happening at 15:00 [14:08:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm [14:08:41] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4102/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075240 (https://phabricator.wikimedia.org/T375515) (owner: 10Majavah) [14:08:49] libc6 uses versioned symbols, if it has the depende on a the specific libc6 that typically means that it use some feature which isn't available in the older libc (usually openat* etc.) [14:08:52] this should be fairly quick (helmfile apply to a small subset of mediawiki deployments) [14:09:07] swfrench-wmf: I can hold off for you [14:09:08] that usually means we'd actually need to rebuild on the older OS/glibc [14:09:44] moritzm: so you would expect it to crash eventually right? (whenever it tries to use those missing symbols) [14:09:55] RECOVERY - Host lists1004 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:10:01] or not even start at all, yes [14:10:23] jnuche: that would be great - thank you! I'll merge my patch now and give you a heads up when I'm clear. [14:10:23] but worth a shot, depends on what features the go codebases uses [14:11:22] it started and it seems to be running ok [14:12:19] yeah I forget now how to check which actual symbols are implicated in the dependency, but anyways yes we're talking hacks [14:13:05] (03CR) 10Btullis: "Thanks. I did initially set out to try to extend the file type, rather than create a custom type, but I ended up backtracking on it." [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [14:13:39] (03PS2) 10Scott French: mw-(api-ext|web): scale back to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962) [14:13:50] (03PS1) 10JMeybohm: Revert "prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points" [puppet] - 10https://gerrit.wikimedia.org/r/1075242 (https://phabricator.wikimedia.org/T375488) [14:14:01] (03CR) 10CI reject: [V:04-1] Revert "prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points" [puppet] - 10https://gerrit.wikimedia.org/r/1075242 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm) [14:15:45] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): scale back to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [14:15:54] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm [14:16:49] (03Merged) 10jenkins-bot: mw-(api-ext|web): scale back to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [14:17:41] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075240 (https://phabricator.wikimedia.org/T375515) (owner: 10Majavah) [14:17:55] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [14:18:10] (03PS2) 10JHathaway: puppetdb: Move JVM config out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:18:16] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [14:18:41] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:18:55] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:19:00] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:19:04] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: docker: registry: Add option to redirect domain root [puppet] - 10https://gerrit.wikimedia.org/r/1075240 (https://phabricator.wikimedia.org/T375515) (owner: 10Majavah) [14:19:23] !log scaled up mw-api-ext ahead of traffic+services switchover - T370962 [14:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:27] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [14:20:00] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:20:07] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1044.eqiad.wmnet with reason: host reimage [14:20:31] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:21:13] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:21:30] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:21:47] !log scaled up mw-web ahead of traffic+services switchover - T370962 [14:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1044.eqiad.wmnet with reason: host reimage [14:24:17] !log dcaro@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1039.eqiad.wmnet with OS bullseye [14:26:07] jnuche: I should be out of your way now - thanks for your patience! be aware that since this makes the deployments larger, it _might_ also have the effect of slowing things down a bit (should not be large, though, given the amount of spare capacity we have in both k8s clusters). [14:26:43] (03PS2) 10JMeybohm: Revert "prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points" [puppet] - 10https://gerrit.wikimedia.org/r/1075242 (https://phabricator.wikimedia.org/T375488) [14:27:07] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [14:27:08] swfrench-wmf: hmm, that's bad luck since we were struggling with timeouts, hopefully it will still be ok [14:27:25] however it's probably too late now for the train prsync before we hit the infra window in 30m [14:27:30] (03PS1) 10Kevin Bazira: ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075245 (https://phabricator.wikimedia.org/T374387) [14:27:31] dcaro: can I help with blackbox-exporter upgrade on bullseye ? [14:27:33] I'll try again after that [14:28:26] swfrench-wmf: will you let me know after the DC switchover is complete? [14:28:54] godog: maybe, the process is quite manual though, create a new prometheus VM, move the volume and make sure everything works, you need to be a toolforge root though [14:29:01] T375523 [14:29:02] T375523: [toolforge-prometheus] upgrade to bookworm - https://phabricator.wikimedia.org/T375523 [14:29:45] dcaro: is in-place upgrade to bookworm on the table for this specific problem/issue ? [14:29:53] jnuche: yes, I'll post here when things are stable [14:30:03] swfrench-wmf: thx! [14:30:18] godog: nah, not worth it I think, I've manually deployed the new binaries, I'll keep an eye see if they crash but it seems to be working ok [14:30:39] dcaro: ok SGTM, good enough of a bandaid [14:30:42] the thing is that VM-wise, in-place upgrades make tracking the OS of the VM complicated and such [14:31:01] sorry about the blindside there, didn't realize grpc is bookworm-only [14:31:20] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage [14:32:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [14:33:58] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage [14:37:06] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:38:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:32] RECOVERY - Host mc2038 is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms [14:47:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:47:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1044.eqiad.wmnet with OS bookworm [14:50:22] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm [14:53:40] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm [14:53:57] dcaro: thanks! [14:56:13] (03CR) 10Ottomata: "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [15:00:05] swfrench-wmf: May I have your attention please! Southward Datacenter Switchover: Services + Traffic. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1500) [15:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1500). [15:00:43] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:11] !log starting switchover day 1 - T370962 [15:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:16] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [15:01:52] !log swfrench@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: Datacenter Switchover, T370962] [15:02:12] !log swfrench@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: Datacenter Switchover, T370962] [15:04:12] going to monitor how things progress over the next 15-20m before moving on to the next step [15:04:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:39] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: sync [15:09:02] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: sync [15:09:25] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage [15:09:48] (03CR) 10Brouberol: [C:03+1] "Oops, sorry, that slipped through the cracks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075226 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [15:13:10] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage [15:15:58] (03PS1) 10JHathaway: puppetboard: use stdlib version of to_python [puppet] - 10https://gerrit.wikimedia.org/r/1075254 [15:16:23] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075254 (owner: 10JHathaway) [15:22:11] things are looking good, and I'm going to proceed with depooling discovery services from eqiad [15:22:23] !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None [15:22:26] !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [15:23:42] !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T370962 [15:23:48] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [15:26:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075254 (owner: 10JHathaway) [15:27:54] (03CR) 10JHathaway: [C:03+2] puppetboard: use stdlib version of to_python [puppet] - 10https://gerrit.wikimedia.org/r/1075254 (owner: 10JHathaway) [15:29:27] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm [15:32:07] (03PS1) 10JHathaway: wmflib::to_python: remove [puppet] - 10https://gerrit.wikimedia.org/r/1075258 [15:33:10] (03CR) 10JHathaway: [C:03+2] wmflib::to_python: remove [puppet] - 10https://gerrit.wikimedia.org/r/1075258 (owner: 10JHathaway) [15:34:51] swfrench-wmf: i was about to do a rolling restart of eventgate-analytlics to pick up a config change. should I not? [15:38:50] i think the rolling restart shouldn't affect the traffic routing and switchover stuff, so i'm going to proceed [15:38:56] ottomata: if it's a fairly low risk change, I don't see any reason why not. be aware that the service is now depooled from eqiad and serving solely from codfw - i.e., to note while you're verifying things (and also bear in mind that codfw is serving all traffic now). [15:38:57] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10172078 (10Krinkle) [15:39:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10172072 (10Papaul) 05Open→03Resolved This is done [15:39:09] swfrench-wmf: great thank you [15:39:25] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync [15:39:43] !log rolling restart of eventgate-analytics to pick up https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073855 [15:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:58] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync [15:40:06] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: sync [15:40:55] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync [15:41:00] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync [15:41:08] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync [15:43:23] (03PS3) 10Ebrahim: Remove metawiki dark mode exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 [15:44:50] (03PS3) 10Jdlrobson: Dark mode: Make LiquidThreads namespace explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562 [15:45:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Update codfw LVS connectivity to support new LSW in rows C & D - https://phabricator.wikimedia.org/T370635#10172089 (10Papaul) 05Open→03Resolved a:03Papaul This is also done we can resolve. [15:48:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#10172117 (10Papaul) 05Open→03Resolved a:03Papaul This is done we are tracking the decom in https://phabricator.wikimedia.org/T375419 and https://... [15:48:15] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10172128 (10Papaul) 05Open→03Resolved a:03Papaul This is done [15:49:12] !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in eqiad: Datacenter Switchover - T370962 [15:49:18] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [15:50:12] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10172139 (10Papaul) [15:50:17] !log switchover day 1 actions are complete - T370962 [15:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:23] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:46] hmmm ... thumbor [15:54:01] It might require more replicas, I think we had that in some past switchover [15:54:35] (03PS1) 10Mforns: Correct port number for data-gateway in commons impact analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075265 (https://phabricator.wikimedia.org/T368035) [15:54:41] godog: np, stuff happens :), XioNoX yw! [15:55:01] akosiaris: ah, that's plausible, I was aware of the issue with the discovery record last time around, but not capacity [15:57:23] Hi all! I'd like to deploy a fix for a minor issue (missing vary header) with some REST endpoints in the next hour. Any concerns? [15:57:44] swfrench-wmf: I was told you are fiddling with things, would a backport deployment be ok with you? [15:58:21] jouncebot nowandnext [15:58:21] For the next 0 hour(s) and 1 minute(s): Southward Datacenter Switchover: Services + Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1500) [15:58:21] For the next 0 hour(s) and 1 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1500) [15:58:21] In 0 hour(s) and 1 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1600) [15:58:30] duesen: of you could hold off for now, that would be greatly appreciatd - we're still troubleshooting an issue [15:58:34] *if you' [16:00:05] jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:07] (nothing in today's puppet window, the conch is still swfrench-wmf's) [16:00:28] swfrench-wmf, duesen: once things are clear, o [16:00:49] er, sorry - once things are clear, i'm needing to run a train presync. fine to have a backport go out first though. [16:00:55] (03CR) 10Santiago Faci: [C:03+2] Correct port number for data-gateway in commons impact analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075265 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [16:01:29] (again, once things are clear - no pressure on my end.) [16:02:02] (03Merged) 10jenkins-bot: Correct port number for data-gateway in commons impact analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075265 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [16:02:19] brennen: I'm not ready yet, the patch needs to pass CI first. I hope to be ready in about 40 minutes. [16:03:39] 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10172197 (10jhathaway) >>! In T370786#10169691, @Eevans wrote: > We need to make sure Corto behaves well in a channels with existing bots, so the command namespace needs to be unique; Corto should... [16:04:17] brennen: ack, thanks - I'll let you know [16:06:00] !log swfrench@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=swift-ro,name=eqiad [16:06:59] !log repooled swift-ro in eqiad to potentially mitigate issues with thumbor - T370962 [16:07:04] (03PS4) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) [16:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:05] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [16:07:09] !log mforns@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [16:07:26] !log mforns@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [16:08:15] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:09:55] hello [16:09:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:59] !incidents [16:09:59] 5276 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [16:09:59] 5277 (UNACKED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [16:10:00] 5274 (RESOLVED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [16:10:00] 5273 (RESOLVED) db1246 (paged)/MariaDB Replica Lag: s2 (paged) [16:10:00] 5272 (RESOLVED) db1246 (paged)/MariaDB Replica SQL: s2 (paged) [16:10:00] 5271 (RESOLVED) db1246 (paged)/mysqld processes (paged) [16:10:03] !ack 5276 [16:10:03] 5276 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [16:10:04] !ack 5277 [16:10:04] 5277 (ACKED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [16:10:12] Here. [16:10:27] swift-ro was repooled in eqiad right? [16:10:37] denisse: ACKed them both [16:10:38] sukhe: yes, exactly [16:10:43] sukhe: yes, that could explain the page. [16:12:50] !log Deploying Refinery [16:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:10] hmm that doesn't seem to be getting better [16:13:15] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:13:26] I suspect we might need to pool the `swift` dnsdisc also [16:13:31] (03CR) 10Brouberol: [C:03+1] rdf-streaming-updater: use SSL and external-services fqdn to access kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231 (https://phabricator.wikimedia.org/T333373) (owner: 10DCausse) [16:13:51] that was done it seems? [16:13:52] set/pooled=true; selector: dnsdisc=swift-ro,name=eqiad [16:14:17] (03PS1) 10Daniel Kinzler: REST: vary on x-restbase-compat header if present [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075269 (https://phabricator.wikimedia.org/T374136) [16:14:23] alright, so the current status is as follows: errors started a bit after 15:40, which is a bit after we depooled swift-ro in eqiad [16:14:35] swfrench-wmf: yeah matches exactly [16:14:39] !log sfaci@deploy1003 Started deploy [analytics/refinery@cdcefda]: Regular analytics weekly train [analytics/refinery@cdcefda6] [16:14:45] sukhe: there's a dnsdisc service for `swift` with no -ro or -rw also [16:14:54] hnowlan: ah [16:15:05] <_joe_> I would guess that's the one used [16:15:25] bah, it'll hit that one now [16:15:36] swfrench-wmf: ok thanks [16:15:40] *I'll repool that in eqiad now [16:15:45] <_joe_> also remember there's a 5 minutes delay if you're not wiping caches [16:16:21] we can wipe them but I will hold off [16:16:35] !log swfrench@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad [16:16:39] !log swfrench@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad [16:17:09] I hit enter too quickly, but yes, should be re-pooling now, modulo the delay _joe_ mentioned [16:17:29] seeing an increase in thumbor loads in eqiad already [16:17:42] <_joe_> you can check the ats config and see it mentions swift.discovery.wmnet, indeed [16:17:45] errors going down as well [16:17:53] 10ops-codfw, 06SRE, 06DC-Ops: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495#10172257 (10Jhancock.wm) a:05Papaul→03Jhancock.wm so I swapped the CPUs, and it boots. but the idrac won't connect. when I connect a front plate the ip shows as 0... [16:18:15] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:28] !log repooled swift in eqiad to potentially mitigate issues with thumbor and swift - T370962 [16:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:34] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [16:18:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc2038.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [16:19:14] (03PS2) 10Daniel Kinzler: REST: vary on x-restbase-compat header if present [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075269 (https://phabricator.wikimedia.org/T374136) [16:19:20] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes mw2424 and mw2425 - https://phabricator.wikimedia.org/T375398#10172251 (10Jhancock.wm) a:03Jhancock.wm wrong ticket [16:19:32] that's looking a lot better [16:19:41] yep [16:19:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:19:52] so probably a note on the swift vs swift-ro,rw distinction [16:19:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:53] sukhe: yeah, I can investigate the history a bit more today and document and / or add the service to the exclusion list [16:22:25] !log sfaci@deploy1003 Finished deploy [analytics/refinery@cdcefda]: Regular analytics weekly train [analytics/refinery@cdcefda6] (duration: 07m 46s) [16:22:45] !log sfaci@deploy1003 Started deploy [analytics/refinery@cdcefda] (thin): Regular analytics weekly train THIN [analytics/refinery@cdcefda6] [16:23:33] FIRING: KubernetesCalicoDown: mw2427.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2427.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:24:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc2038.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [16:26:42] (03PS2) 10RLazarus: deployment_server: mwscript_k8s usability tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1075098 [16:27:20] !log sfaci@deploy1003 Finished deploy [analytics/refinery@cdcefda] (thin): Regular analytics weekly train THIN [analytics/refinery@cdcefda6] (duration: 04m 34s) [16:27:37] !log sfaci@deploy1003 Started deploy [analytics/refinery@cdcefda] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@cdcefda6] [16:28:01] denisse: sukhe: alright, I believe the dust has settled here. thank you both for your assistance and patience with this bit of switchover fallout [16:28:13] <3 [16:28:25] (03CR) 10Ottomata: [C:03+2] Remove labswiki from HDFS imported dumps [puppet] - 10https://gerrit.wikimedia.org/r/1075238 (https://phabricator.wikimedia.org/T217792) (owner: 10Joal) [16:28:33] swfrench-wmf: ACK, thank you! :) [16:29:29] (03CR) 10RLazarus: deployment_server: mwscript_k8s usability tweaks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075098 (owner: 10RLazarus) [16:30:48] (03CR) 10Aaron Schulz: [C:03+2] REST: vary on x-restbase-compat header if present [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075269 (https://phabricator.wikimedia.org/T374136) (owner: 10Daniel Kinzler) [16:30:50] !log sfaci@deploy1003 Finished deploy [analytics/refinery@cdcefda] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@cdcefda6] (duration: 03m 13s) [16:31:17] swfrench-wmf: clear for some train operations? [16:31:42] brennen: was just about to say, heh - yes, at this point, I think you should be good to go [16:31:47] thanks! [16:33:45] logstash.wikimedia.org seems broken to me. Our team's dashboard has disappeared and the home page seems to be an outdated version. [16:34:14] When I view logstash.wikimedia.org I see "AHT Team" which was removed a while ago [16:34:23] And no "Trust and Safety Product" [16:35:06] Plus when I open our team's dashboard from a link we have saved in a google doc, there is an error saying the dashboard does not exist. The URL we have saved is https://logstash.wikimedia.org/app/dashboards#/view/bc0caa20-92d5-11ee-b8fa-893e52d5cd7d?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-1w%2Cto%3Anow)) [16:35:24] that's no good - it appears to be a product of the failover to codfw [16:35:38] * cwhite looks into rolling that back [16:35:47] swfrench-wmf: ^^ [16:36:01] (03PS5) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) [16:36:09] ah, interesting! [16:36:25] sounds like it might make sense to switch logstash back to eqiad [16:36:36] cwhite: SGTY? [16:36:43] yes, please :) [16:36:45] 10ops-codfw, 06SRE, 06DC-Ops: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495#10172303 (10Jhancock.wm) provisioning script failed. it can't get a connection to the server. set the idrac manually, but still doesn't ping. This is still under warr... [16:37:54] !log swfrench@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=logstash,name=eqiad [16:38:06] !log swfrench@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=logstash,name=codfw [16:38:58] !log switched logstash.discovery.wmnet back to eqiad due to reports of stale dashboards - T370962 [16:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:08] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [16:39:50] Do you know how long the cache might last for switching back to eqiad? [16:39:56] (03CR) 10Volans: "LGTM, non voting just because I don't have the specific context for the change." [puppet] - 10https://gerrit.wikimedia.org/r/1075098 (owner: 10RLazarus) [16:40:27] Dreamy_Jazz: give your browser a hard refresh and let us know if the issue is corrected? [16:40:40] Dreamy_Jazz: it should happen with 5m, but I can clear the recursor caches right now [16:40:53] It's fixed for me now. Thanks. [16:41:13] Thank you swfrench-wmf! [16:41:18] !log brennen@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.24 refs T373643 [16:41:25] T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643 [16:41:26] glad to hear that did it [16:41:36] cwhite: no problem at all, and thanks for flagging! [16:42:20] <3 [16:48:00] swfrench-wmf, brennen: CI is in a bad mood, we'll do it later in the day [16:49:38] 06SRE-OnFire, 10Incident Tooling: Corto: Access model (MVP only) - https://phabricator.wikimedia.org/T375451#10172328 (10jhathaway) I think keeping everything has private is the best path for the MVP. [16:50:09] duesen: ack [16:51:39] (03CR) 10CI reject: [V:04-1] REST: vary on x-restbase-compat header if present [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075269 (https://phabricator.wikimedia.org/T374136) (owner: 10Daniel Kinzler) [16:52:06] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10172323 (10ovasileva) p:05High→03Medium Potential next steps in addition to the tickets above is QTE preparation and test cases. [16:53:30] (03CR) 10Btullis: "I have a question though. Why can't we just enable dumps of labswiki, now that it has been moved to the core DB servers?" [puppet] - 10https://gerrit.wikimedia.org/r/1075238 (https://phabricator.wikimedia.org/T217792) (owner: 10Joal) [16:54:22] !log Deployed refinery using scap, then deployed onto hdfs [16:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:57:53] !log sfaci@deploy1003 Started deploy [airflow-dags/analytics@e1fb17b]: deploying new datahub ingestion [16:58:30] !log sfaci@deploy1003 Finished deploy [airflow-dags/analytics@e1fb17b]: deploying new datahub ingestion (duration: 00m 52s) [16:59:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc2038.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1700) [17:01:13] (03CR) 10JHathaway: [C:03+2] vrts_aliases: add a basic safeguard [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) (owner: 10JHathaway) [17:02:14] (03Merged) 10jenkins-bot: REST: vary on x-restbase-compat header if present [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075269 (https://phabricator.wikimedia.org/T374136) (owner: 10Daniel Kinzler) [17:02:26] (03PS2) 10JHathaway: vrts_aliases: add a basic safeguard [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) [17:03:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2038.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [17:06:03] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[4-5] - https://phabricator.wikimedia.org/T372512#10172392 (10Jhancock.wm) @colewhite I might need a moment with these servers. getting errors on prometheus. T375328 [17:07:31] 10ops-codfw, 06SRE, 06DC-Ops: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495#10172394 (10Jhancock.wm) @jijiki papaul helped me get it singable and the system is back online. The parts have been swapped and we might need to test under full load... [17:09:03] !log brennen@deploy1003 Finished scap sync-world: testwikis to 1.43.0-wmf.24 refs T373643 (duration: 27m 45s) [17:09:10] T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643 [17:10:35] !log train presync finished successfully; going AFK for ~45 minutes but will return to roll group0 during train window (T373643, T375477) [17:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:43] T375477: Helm deployment timeouts during train presync - https://phabricator.wikimedia.org/T375477 [17:11:28] (03CR) 10Hokwelum: [C:03+1] Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 (owner: 10Krinkle) [17:17:34] (03CR) 10Btullis: Add an hdfs_file type and provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [17:19:16] (03CR) 10Bartosz Dziewoński: [C:04-1] "I tested this on the beta cluster and it doesn't work. I don't know why, I will try debugging it some other time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) (owner: 10Bartosz Dziewoński) [17:21:50] (03PS1) 10Bking: admin-ng: add airflow namespaces to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075278 (https://phabricator.wikimedia.org/T374948) [17:34:54] (03PS19) 10Ssingh: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [17:37:52] (03PS1) 10Stoyofuku-wmf: Deploy donate link to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) [17:39:42] (03CR) 10Ottomata: Add an hdfs_file type and provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [17:40:36] (03CR) 10Ottomata: Add an hdfs_file type and provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [17:47:47] (03PS1) 10Reedy: MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 [17:52:12] (03CR) 10Jforrester: [C:04-1] "Let's not regress portals." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 (owner: 10Reedy) [17:52:25] (03PS2) 10Jforrester: MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 (owner: 10Reedy) [17:52:31] lol [18:00:05] brennen and jnuche: Time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1800). [18:01:20] swfrench-wmf, brennen: Can I go ahead with the backport now? The train already rolled to group0 earlier today, right? I could also wait for the backport window, but it's getting late here... [18:02:46] duesen: no objections on my end, but I'd probably defer to brennen here [18:03:08] duesen: the train is not rolled out to group0 and brennen is actually planning to do it during this window [18:03:25] RESOLVED: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:03:37] Amir1: ah ok, thanks. Let's hop on our sync call, then :) [18:03:54] (03PS3) 10Reedy: MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 [18:05:00] (03CR) 10CI reject: [V:04-1] MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 (owner: 10Reedy) [18:05:26] (03PS4) 10Reedy: MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 [18:05:37] brennen: Amir just reminded me that the patch will actually ride the train since it's already merged into wmf.24. Here it is: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1075269 [18:06:03] (03PS1) 10Cwhite: es-exporter: add wikifunctions queries [puppet] - 10https://gerrit.wikimedia.org/r/1075284 (https://phabricator.wikimedia.org/T371426) [18:06:18] (03PS8) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [18:06:34] (03CR) 10CI reject: [V:04-1] es-exporter: add wikifunctions queries [puppet] - 10https://gerrit.wikimedia.org/r/1075284 (https://phabricator.wikimedia.org/T371426) (owner: 10Cwhite) [18:07:11] (03CR) 10BCornwall: varnish: Occasional RSA cert connection warnings (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [18:08:56] (03PS2) 10Cwhite: es-exporter: add wikifunctions queries [puppet] - 10https://gerrit.wikimedia.org/r/1075284 (https://phabricator.wikimedia.org/T371426) [18:09:25] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:10:01] o/ [18:10:48] (03PS3) 10Cwhite: es-exporter: add wikifunctions queries [puppet] - 10https://gerrit.wikimedia.org/r/1075284 (https://phabricator.wikimedia.org/T371426) [18:10:53] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075289 (https://phabricator.wikimedia.org/T373643) [18:10:54] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075289 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot) [18:11:03] going ahead to group0. [18:11:38] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075289 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot) [18:14:51] (03CR) 10Jdlrobson: [C:03+1] Deploy donate link to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [18:21:34] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.24 refs T373643 [18:21:41] T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643 [18:35:08] (03PS3) 10RLazarus: deployment_server: mwscript_k8s usability tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1075098 [18:42:35] (03PS9) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [18:44:26] (03PS1) 10BCornwall: varnish: Cast test resources to str [puppet] - 10https://gerrit.wikimedia.org/r/1075300 [18:48:33] (03CR) 10Scott French: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1075098 (owner: 10RLazarus) [18:49:20] (03CR) 10Nik Gkountas: [C:03+1] ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075245 (https://phabricator.wikimedia.org/T374387) (owner: 10Kevin Bazira) [18:50:29] (03CR) 10RLazarus: [C:03+2] "Thanks both!" [puppet] - 10https://gerrit.wikimedia.org/r/1075098 (owner: 10RLazarus) [18:53:52] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus1008.eqiad.wmnet [19:00:54] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1008.eqiad.wmnet [19:04:43] (03PS10) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [19:05:19] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1075300 (owner: 10BCornwall) [19:14:46] (03PS11) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [19:17:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q1): Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10172825 (10andrea.denisse) 05Open→03Resolved Hi, I've resynced the drive and it's now part of our RAID array: ` denisse@prometheus1008:~$ sudo mdadm --de... [19:19:05] (03CR) 10Joal: "One question inline" [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [19:31:17] (03CR) 10JHathaway: [C:03+1] puppetdb: Move JVM config out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [19:40:13] (03CR) 10Cwhite: [C:03+2] es-exporter: add wikifunctions queries [puppet] - 10https://gerrit.wikimedia.org/r/1075284 (https://phabricator.wikimedia.org/T371426) (owner: 10Cwhite) [19:42:13] (03PS2) 10Samtar: Set `wgMFCustomSiteModules` to false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075306 (https://phabricator.wikimedia.org/T375540) (owner: 10Steven Rawson) [19:48:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075306 (https://phabricator.wikimedia.org/T375540) (owner: 10Steven Rawson) [19:48:38] TheresNoTime, ^ [19:49:26] ta! Should probably stop laying on the bed and move towards a laptop [19:49:47] priorities [19:50:04] Do you have https://wikitech.wikimedia.org/wiki/WikimediaDebug installed? [19:50:11] just did yes [19:53:23] may as well start now [19:53:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075306 (https://phabricator.wikimedia.org/T375540) (owner: 10Steven Rawson) [19:54:22] (03Merged) 10jenkins-bot: Set `wgMFCustomSiteModules` to false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075306 (https://phabricator.wikimedia.org/T375540) (owner: 10Steven Rawson) [19:55:23] (03PS1) 10C. Scott Ananian: Use `class` instead of `id` for scribunto errors [extensions/Scribunto] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075309 (https://phabricator.wikimedia.org/T375539) [19:55:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Scribunto] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075309 (https://phabricator.wikimedia.org/T375539) (owner: 10C. Scott Ananian) [19:56:38] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1075306|Set `wgMFCustomSiteModules` to false for mediawikiwiki (T375540)]] [19:56:45] T375540: Set wgMFCustomSiteModules to false for mediawikiwiki - https://phabricator.wikimedia.org/T375540 [19:58:37] !log samtar@deploy1003 samtar, izno: Backport for [[gerrit:1075306|Set `wgMFCustomSiteModules` to false for mediawikiwiki (T375540)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:58:59] Izno: your patch is now ready for testing — you can use WikimediaDebug to pick any of the test servers and check if the change works as expected on mediawikiwiki [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T2000) [20:00:05] Krinkle, toyofuku, derenrich, Izno, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] o/ [20:00:24] hellooo [20:00:24] Izno: given the type of patch, I'm not sure if its actually testable.. [20:00:37] I think I can test it, I just don't know what exactly I'm doing :) [20:02:12] (03PS1) 10CDanis: coredns: add support for Service externalIPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075311 (https://phabricator.wikimedia.org/T344171) [20:02:36] Izno: so if you go to, e.g. https://www.mediawiki.org/wiki/MediaWiki:Common.css, and in your browser you should see the WikimediaDebug extension icon (a wikitech globe iirc) — click that, select k8s-mwdebug, and then click the toggle [20:03:04] toggle clicked [20:03:30] if you refresh that tab, you're now using a test server where your config patch has been applied [20:03:47] ok, cool [20:04:27] (make sure you toggle that off when you're done) — as for testing your config patch, I'm not sure how you'd really do that [20:04:47] test here is to add something to Common.css, see if it pops through in console or not [20:04:51] while on the mobile domain [20:05:22] ack — let me know when you're confident your patch works, and then I'll continue the sync :) [20:07:11] (just fyi this is my second attempt at my first patch, so not exactly sure what's supposed to be happening here) [20:08:18] derenrich: just handling Izno's patch as I began that one early, then I'll be working through the list at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T2000 [20:08:30] cool [20:09:29] brennen: With the train landing on group0, I would have expected this backport to be live now: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1075269. But that doesn't seem to be the case. At least, I can't observe the changed Vary header in the response. Can you confirm that this patch is in the history of the version that's deployed? Am I missing something? [20:10:00] i'm here [20:10:11] TheresNoTime, it's functional [20:10:38] duesen: that wasn't deployed I believe — I got a warning that it was on the deployment server but not sync'd [20:10:42] duesen: on gerrit there's an "included in" drop down [20:10:58] duesen: it'll be sync'd once I finished Izno's patch afaik [20:10:59] cscott: oh nice, thanks! [20:11:00] Izno: ack [20:11:03] !log samtar@deploy1003 samtar, izno: Continuing with sync [20:11:25] TheresNoTime: ah, thanks! I was afraid that this would happen. Sorry for the mess. [20:11:32] Under the 'three dots' menu at the top right https://usercontent.irccloud-cdn.com/file/GiAt4BS1/image.png [20:12:23] cscott: that's neat, but not useful in this case - since it actually *is* on the branch, but didn't get deployed :) [20:12:25] ^ but as TheresNoTime says, just because it is on the branch doesn't *necessarily* mean that it was deployed :) [20:12:32] yeah. [20:12:46] anyway, just sharing some useful gerrit trivia really [20:15:07] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [20:15:50] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075306|Set `wgMFCustomSiteModules` to false for mediawikiwiki (T375540)]] (duration: 19m 11s) [20:15:56] T375540: Set wgMFCustomSiteModules to false for mediawikiwiki - https://phabricator.wikimedia.org/T375540 [20:15:59] Izno: live on production now :) [20:16:05] :D [20:16:08] duesen: yours should be too, can you check its working as expected? [20:16:53] (03PS4) 10Krinkle: Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 [20:17:02] Krinkle: around for your deployment? [20:18:08] (03PS2) 10Stoyofuku-wmf: Deploy donate link to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) [20:18:32] toyofuku: will move to yours next, ready? :) [20:18:37] yep! [20:19:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [20:20:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [20:20:55] (03Merged) 10jenkins-bot: Deploy donate link to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [20:21:17] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1074550|Deploy donate link to all wikis (T373585)]] [20:21:23] T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585 [20:23:19] !log samtar@deploy1003 samtar, toyofuku: Backport for [[gerrit:1074550|Deploy donate link to all wikis (T373585)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:23:33] toyofuku: ready for testing — is this something you can test? [20:23:41] Yep! Testing now [20:23:59] Gonna be a couple minutes as I want to be thorough, but I'll go quick 🤞 [20:24:35] no problem! just ping me when you're ready :) [20:27:01] TheresNoTime: we're looking good, thank you! [20:27:02] TheresNoTime: yes, it works! [20:27:16] !log samtar@deploy1003 samtar, toyofuku: Continuing with sync [20:27:31] ack :) [20:32:29] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074550|Deploy donate link to all wikis (T373585)]] (duration: 11m 11s) [20:32:35] T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585 [20:32:53] toyofuku: that's live :) [20:33:06] (03PS2) 10DErenrich: Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311 [20:33:28] derenrich: ready for your patch? Did you say this was a retry? [20:33:46] ready. and it never merged due to some issue unrelated to my patch [20:33:47] (03PS2) 10Bking: admin-ng: add airflow namespaces to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075278 (https://phabricator.wikimedia.org/T374948) [20:33:47] ty!!! [20:34:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311 (owner: 10DErenrich) [20:34:56] (03Merged) 10jenkins-bot: Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311 (owner: 10DErenrich) [20:35:16] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1074311|Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension]] [20:35:21] does it matter which server i point wikimedia debug to? [20:35:38] derenrich: no, any will work [20:37:15] !log samtar@deploy1003 derenrich, samtar: Backport for [[gerrit:1074311|Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:37:27] derenrich: ready for testing now [20:37:36] it's working! [20:37:45] cscott: I'm going to set your patch merging now [20:37:48] !log samtar@deploy1003 derenrich, samtar: Continuing with sync [20:37:52] ok, thanks! [20:37:55] * Krinkle is around albeit late [20:37:58] (03CR) 10Samtar: [C:03+2] Use `class` instead of `id` for scribunto errors [extensions/Scribunto] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075309 (https://phabricator.wikimedia.org/T375539) (owner: 10C. Scott Ananian) [20:38:40] TheresNoTime: happy to tag join still if possible, otherwise I might roll it out later today [20:39:06] Krinkle: no worries, will probably be able to get yours out while ^ merges [20:39:26] (03PS5) 10Krinkle: Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 [20:39:49] thanks for helping! i assume i'm good to leave? [20:40:36] derenrich: Ideally you'd test it again in prod once its finished syncing (a couple more minutes) [20:40:43] ok i can wait [20:41:54] https://test.wikipedia.org/wiki/T375539 is my test page once the backport deploys [20:42:22] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074311|Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension]] (duration: 07m 05s) [20:42:29] derenrich: live on prod now [20:42:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 (owner: 10Krinkle) [20:42:47] yup working [20:42:50] thanks so much [20:42:54] No problem! :) [20:43:04] thx [20:43:29] (03Merged) 10jenkins-bot: Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 (owner: 10Krinkle) [20:43:39] cscott: I left starting your merge a little late, sorry — still going to be another 20m or so, is that okay? [20:43:51] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1071287|Remove unused wgStatsMethod, wgResourceLoaderClientPreferences]] [20:44:08] yeah i can deal [20:44:20] zuul is very slow today [20:45:02] there's a couple of jobs running for 1-2hrs :/ [20:45:12] (03PS1) 10Krinkle: labs: Remove unused wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075313 [20:45:43] !log samtar@deploy1003 samtar, krinkle: Backport for [[gerrit:1071287|Remove unused wgStatsMethod, wgResourceLoaderClientPreferences]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:45:48] there was a job for Experiment:GrowthExperiments which failed one of its parallel tests and then ran for an additional *hour* running the other branches of the parallel test suite despite the entire build being doomed [20:45:48] Krinkle: are you going to want to test your patch, or is "nothing breaking" enough? :D [20:46:16] I'll run it through mwdebug to check for any surprise phph warnings but other than that no [20:46:39] ack, lemme know when I can sync :) [20:47:17] LGTM, go ahead [20:47:25] !log samtar@deploy1003 samtar, krinkle: Continuing with sync [20:47:50] !log zabe@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Jasonb28' 'MichiganNJPat' # T375516 [20:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:56] T375516: Unblock stuck global rename of Jasonb28 - https://phabricator.wikimedia.org/T375516 [20:49:14] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10173104 (10Jclark-ctr) 05Open→03Resolved [20:52:04] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071287|Remove unused wgStatsMethod, wgResourceLoaderClientPreferences]] (duration: 08m 13s) [20:52:33] Krinkle: want me to do the beta-only 1075313 while I'm waiting? [20:52:49] sure, feel free to merge. [20:53:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075313 (owner: 10Krinkle) [20:54:15] (03Merged) 10jenkins-bot: labs: Remove unused wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075313 (owner: 10Krinkle) [20:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:58:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10173118 (10Jclark-ctr) @jijiki please do update to preseed.yaml, and site.pp when you can we have received these and can not move forward until that step is completed. [21:00:22] TheresNoTime: subbu is going to step in to verify the backport if I'm AFK when jenkins is done [21:00:48] cscott: no worries, there's about 8m left for the merge fwiw [21:01:27] (i am here) [21:03:52] (03Merged) 10jenkins-bot: Use `class` instead of `id` for scribunto errors [extensions/Scribunto] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075309 (https://phabricator.wikimedia.org/T375539) (owner: 10C. Scott Ananian) [21:04:22] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1075309|Use `class` instead of `id` for scribunto errors (T375539)]] [21:04:29] T375539: Scribunto generates duplicate IDs when there are errors on fragments included more than once on a page - https://phabricator.wikimedia.org/T375539 [21:06:22] !log samtar@deploy1003 samtar, cscott: Backport for [[gerrit:1075309|Use `class` instead of `id` for scribunto errors (T375539)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:34] cscott / subbu ^ [21:07:37] ack [21:08:20] 10SRE-swift-storage, 13Patch-For-Review: Set up new S3-level replicated storage cluster "apus" - https://phabricator.wikimedia.org/T279621#10173141 (10Scott_French) @MatthewVernon - FYI, while reviewing the logs from the first part of the switchover earlier today, I noticed that `apus` is depooled everywhere,... [21:09:09] TheresNoTime, lgtm. (/cc cscott) [21:09:21] !log samtar@deploy1003 samtar, cscott: Continuing with sync [21:11:38] (03PS1) 10Scott French: sre.discovery.datacenter: exclude kibana7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) [21:11:38] (03CR) 10Scott French: "Given the the discussion in T375544, I think this seems like the most sensible way forward for now. Thanks in advance for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) (owner: 10Scott French) [21:13:08] Thanks subbu! [21:13:58] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075309|Use `class` instead of `id` for scribunto errors (T375539)]] (duration: 09m 36s) [21:14:05] T375539: Scribunto generates duplicate IDs when there are errors on fragments included more than once on a page - https://phabricator.wikimedia.org/T375539 [21:15:02] !log UTC late backport window done [21:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:11] (03PS4) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) [21:15:46] (03CR) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [21:21:22] (03PS1) 10Zabe: Initial configuration for madwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075316 (https://phabricator.wikimedia.org/T374968) [21:23:06] (03CR) 10Zabe: [C:03+2] Initial configuration for madwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075316 (https://phabricator.wikimedia.org/T374968) (owner: 10Zabe) [21:24:04] (03Merged) 10jenkins-bot: Initial configuration for madwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075316 (https://phabricator.wikimedia.org/T374968) (owner: 10Zabe) [21:25:15] !log create Wiktionary Madurese # T374968 [21:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:21] T374968: Create Wiktionary Madurese - https://phabricator.wikimedia.org/T374968 [21:25:44] !log zabe@deploy1003 Started scap sync-world: Creating madwiktionary (T374968) [21:27:28] (03PS1) 10Zabe: Initial configuration for kgewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075318 (https://phabricator.wikimedia.org/T374813) [21:29:28] (03CR) 10Zabe: [C:03+2] Initial configuration for kgewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075318 (https://phabricator.wikimedia.org/T374813) (owner: 10Zabe) [21:30:31] (03Merged) 10jenkins-bot: Initial configuration for kgewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075318 (https://phabricator.wikimedia.org/T374813) (owner: 10Zabe) [21:30:44] (03CR) 10Cwhite: [C:03+1] "Thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) (owner: 10Scott French) [21:32:34] !log zabe@deploy1003 Finished scap sync-world: Creating madwiktionary (T374968) (duration: 06m 50s) [21:32:40] T374968: Create Wiktionary Madurese - https://phabricator.wikimedia.org/T374968 [21:32:47] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=madwiktionary --cluster=all 2>&1 | tee /tmp/madwiktionary.UpdateSearchIndexConfig.log # T374968 [21:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:18] !log zabe@deploy1003 Started scap sync-world: Creating kgewiki (T374813) [21:34:24] T374813: Create Wikipedia Komering - https://phabricator.wikimedia.org/T374813 [21:35:47] (03CR) 10BCornwall: [C:03+2] varnish: Cast test resources to str [puppet] - 10https://gerrit.wikimedia.org/r/1075300 (owner: 10BCornwall) [21:40:54] !log zabe@deploy1003 Finished scap sync-world: Creating kgewiki (T374813) (duration: 06m 35s) [21:41:01] T374813: Create Wikipedia Komering - https://phabricator.wikimedia.org/T374813 [21:41:28] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=kgewiki --cluster=all 2>&1 | tee /tmp/kgewiki.UpdateSearchIndexConfig.log # T374813 [21:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:15] (03PS1) 10Zabe: Initial configuration for moswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075320 (https://phabricator.wikimedia.org/T374641) [21:48:40] (03PS1) 10Bking: airflow: allow traffic to webserver port from dse-k8s pods [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) [21:48:58] (03CR) 10Zabe: [C:03+2] Initial configuration for moswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075320 (https://phabricator.wikimedia.org/T374641) (owner: 10Zabe) [21:49:09] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [21:49:41] (03Merged) 10jenkins-bot: Initial configuration for moswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075320 (https://phabricator.wikimedia.org/T374641) (owner: 10Zabe) [21:50:28] !log zabe@deploy1003 Started scap sync-world: Creating moswiki (T374641) [21:50:34] T374641: Create Wikipedia Mooré - https://phabricator.wikimedia.org/T374641 [21:56:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10173348 (10Papaul) [21:57:17] !log zabe@deploy1003 Finished scap sync-world: Creating moswiki (T374641) (duration: 06m 49s) [21:57:24] T374641: Create Wikipedia Mooré - https://phabricator.wikimedia.org/T374641 [21:57:51] (03PS1) 10Zabe: Initial configuration for gorwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075322 (https://phabricator.wikimedia.org/T375088) [21:58:16] zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=moswiki --cluster=all 2>&1 | tee /tmp/moswiki.UpdateSearchIndexConfig.log # T374641 [21:58:21] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=moswiki --cluster=all 2>&1 | tee /tmp/moswiki.UpdateSearchIndexConfig.log # T374641 [21:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:36] (03PS2) 10Bking: airflow: allow traffic to webserver port from dse-k8s pods [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) [21:58:45] (03CR) 10Zabe: [C:03+2] Initial configuration for gorwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075322 (https://phabricator.wikimedia.org/T375088) (owner: 10Zabe) [21:58:47] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [21:59:30] (03Merged) 10jenkins-bot: Initial configuration for gorwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075322 (https://phabricator.wikimedia.org/T375088) (owner: 10Zabe) [22:00:06] !log zabe@deploy1003 Started scap sync-world: Creating gorwikiquote (T375088) [22:00:23] T375088: Create Wikiquote Gorontalo - https://phabricator.wikimedia.org/T375088 [22:03:11] bye wikibugs [22:06:58] (03CR) 10Zabe: [C:03+2] Initial configuration for shnwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075323 (https://phabricator.wikimedia.org/T375430) (owner: 10Zabe) [22:07:02] !log zabe@deploy1003 Finished scap sync-world: Creating gorwikiquote (T375088) (duration: 06m 56s) [22:07:08] T375088: Create Wikiquote Gorontalo - https://phabricator.wikimedia.org/T375088 [22:07:23] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=gorwikiquote --cluster=all 2>&1 | tee /tmp/gorwikiquote.UpdateSearchIndexConfig.log # T375088 [22:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:43] (03Merged) 10jenkins-bot: Initial configuration for shnwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075323 (https://phabricator.wikimedia.org/T375430) (owner: 10Zabe) [22:08:43] !log zabe@deploy1003 Started scap sync-world: Creating shnwikinews (T375430) [22:08:49] T375430: Create Wikinews Shan - https://phabricator.wikimedia.org/T375430 [22:09:40] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:14:42] (03PS1) 10BCornwall: Remove rsa-2048 certs from services [puppet] - 10https://gerrit.wikimedia.org/r/1075326 (https://phabricator.wikimedia.org/T375569) [22:15:05] (03CR) 10BCornwall: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075326 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [22:15:30] !log zabe@deploy1003 Finished scap sync-world: Creating shnwikinews (T375430) (duration: 06m 47s) [22:15:34] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=shnwikinews --cluster=all 2>&1 | tee /tmp/shnwikinews.UpdateSearchIndexConfig.log # T375430 [22:15:37] T375430: Create Wikinews Shan - https://phabricator.wikimedia.org/T375430 [22:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:43] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075327 [22:16:44] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075327 (owner: 10Zabe) [22:17:24] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075327 (owner: 10Zabe) [22:20:31] !log zabe@deploy1003 Started scap sync-world: update interwiki cache [22:27:23] !log zabe@deploy1003 Finished scap sync-world: update interwiki cache (duration: 06m 52s) [22:33:39] !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=elwiki # T363538 [22:33:42] !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=cawiki # T363538 [22:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:45] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [22:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:16] !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=eowikinews # T363538 [22:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:31] !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=foundationwiki # T363538 [22:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:50] !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=gurwiki # T363538 [22:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:06] !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=metawiki # T363538 [22:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:44] !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=orwiki # T363538 [22:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:53] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [22:42:03] !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=pawiki # T363538 [22:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:16] !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=testwiki # T363538 [22:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:35] !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=tumwiki # T363538 [22:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:50] !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=viwiki # T363538 [22:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:38] (03PS1) 10DErenrich: Bump coverage of the add-a-fact quicksurvey to 0.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075333 [23:36:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 808.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1075334 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1075334 (owner: 10TrainBranchBot) [23:39:01] (03PS1) 10BPirkle: REST: Adjust REST Sandbox spec for new specs module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075335 (https://phabricator.wikimedia.org/T375512) [23:41:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 815.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded