[00:01:12] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[00:02:24] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[00:03:25] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:03:48] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1042
[00:04:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1051.eqiad.wmnet with reason: host reimage
[00:05:00] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1042
[00:05:22] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[00:08:43] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1051.eqiad.wmnet with reason: host reimage
[00:08:53] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1041
[00:08:55] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1041
[00:08:59] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1043
[00:10:11] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1043
[00:10:20] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1044
[00:10:42] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1075087 (owner: 10TrainBranchBot)
[00:11:05] <logmsgbot>	 !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host db1246.eqiad.wmnet
[00:11:38] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1044
[00:14:37] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1063 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[00:15:03] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1042.eqiad.wmnet with OS bookworm
[00:15:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1042.eqiad.wmnet with OS bookworm
[00:20:36] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm
[00:20:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10169767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm
[00:21:48] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1043.eqiad.wmnet with OS bookworm
[00:21:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169768 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1043.eqiad.wmnet with OS bookworm
[00:22:54] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[00:23:16] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[00:23:17] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1051.eqiad.wmnet with OS bookworm
[00:23:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169769 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1051.eqiad.wmnet with OS bookworm completed:...
[00:28:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:29:23] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1042.eqiad.wmnet with reason: host reimage
[00:32:36] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[00:32:37] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1042.eqiad.wmnet with reason: host reimage
[00:35:36] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage
[00:39:37] <logmsgbot>	 !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage
[00:40:18] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1043.eqiad.wmnet with reason: host reimage
[00:40:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:41:32] <jinxer-wm>	 FIRING: KubernetesCalicoDown: dse-k8s-worker1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[00:43:58] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1043.eqiad.wmnet with reason: host reimage
[00:46:57] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[00:52:10] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[00:52:11] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1042.eqiad.wmnet with OS bookworm
[00:52:20] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:52:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1042.eqiad.wmnet with OS bookworm completed:...
[00:52:36] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:54:12] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:54:26] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[00:56:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169829 (10Jclark-ctr)
[00:57:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169834 (10Jclark-ctr)
[00:58:19] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[00:58:33] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[00:58:34] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1043.eqiad.wmnet with OS bookworm
[00:58:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169837 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1043.eqiad.wmnet with OS bookworm completed:...
[00:59:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169838 (10Jclark-ctr)
[01:00:12] <wikibugs>	 10ops-eqiad, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T375455 (10phaultfinder) 03NEW
[01:00:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169845 (10Jclark-ctr)
[01:00:40] <logmsgbot>	 !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1246.eqiad.wmnet with OS bookworm
[01:00:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10169847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm completed: - db1246 (**WARN**)   - Removed fr...
[01:03:26] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[01:07:41] <wikibugs>	 (03PS1) 10Papaul: Remove db1246 from using partman/custom/db.cfg end of testing [puppet] - 10https://gerrit.wikimedia.org/r/1075092 (https://phabricator.wikimedia.org/T374215)
[01:08:02] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.24 [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075093 (https://phabricator.wikimedia.org/T373643)
[01:08:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.24 [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075093 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot)
[01:08:40] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[01:08:44] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[01:08:44] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1052.eqiad.wmnet with OS bookworm
[01:08:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1052.eqiad.wmnet with OS bookworm completed:...
[01:14:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10169864 (10Jclark-ctr) a:03VRiley-WMF
[01:14:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[01:15:16] <wikibugs>	 10ops-eqiad, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T375455#10169867 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Corrected manual ip on new supermicro server
[01:15:22] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:15:28] <wikibugs>	 (03CR) 10Ssingh: "Sorry it took a while for the review." [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[01:15:36] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:16:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10169871 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Relocated 3 servers that had not been imaged to new rack
[01:16:54] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[01:17:58] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:21:22] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 6.541 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:21:26] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:21:48] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:22:31] <wikibugs>	 (03CR) 10Ssingh: "modules/varnish/templates/browsersec.body.html.erb also should be updated with the text." [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[01:29:26] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:30:18] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:33:39] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[01:34:12] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Remove db1246 from using partman/custom/db.cfg end of testing [puppet] - 10https://gerrit.wikimedia.org/r/1075092 (https://phabricator.wikimedia.org/T374215) (owner: 10Papaul)
[01:35:32] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.24 [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075093 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot)
[01:42:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10169909 (10Papaul) @ABran-WMF  - For step 1 testing Iused sudo cookbook sre.hosts-dhcp --os bullseye db1246 I destroyed the raid10 configuration and set it  back up...
[02:03:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:31:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10169921 (10Papaul) 05Open→03Resolved This is complete, thanks @Dwisehaupt  @Jhancock.wm
[02:33:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10169923 (10Papaul)
[02:35:07] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10169924 (10Papaul)
[02:38:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:40:04] <icinga-wm>	 PROBLEM - mysqld processes #page on db1246 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[02:40:06] <icinga-wm>	 PROBLEM - MariaDB read only s2 on db1246 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[02:40:11] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s2 #page on db1246 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:40:11] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 #page on db1246 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:40:12] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 #page on db1246 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:46:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10169925 (10Papaul)
[02:59:28] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:01:29] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075096 (https://phabricator.wikimedia.org/T373643)
[03:01:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075096 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot)
[03:02:16] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075096 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot)
[03:02:34] <logmsgbot>	 !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.24  refs T373643
[03:02:38] <stashbot>	 T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643
[03:05:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:06:30] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Server depooled.  Has hardware issues
[03:06:44] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Server depooled.  Has hardware issues
[03:06:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10169930 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a261b5e0-f6ea-4087-9a5c-74f99c8cbc7e) set by eevans@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their servi...
[03:14:44] <icinga-wm>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[03:24:44] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[03:41:24] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10169937 (10jwang) 05Resolved→03Open @ssingh, My manager (@mpopov ) told me I should request for `airflow-analytics-product-admins` group, instead of `deployment...
[04:00:04] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T0400)
[04:04:56] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:06:04] <logmsgbot>	 !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.21 (duration: 06m 02s)
[04:09:56] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:14:56] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:29:00] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[04:29:06] <wikibugs>	 (03PS1) 10RLazarus: deployment_server: mwscript_k8s usability tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1075098
[04:40:55] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:41:32] <jinxer-wm>	 FIRING: KubernetesCalicoDown: dse-k8s-worker1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[05:03:35] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance
[05:03:49] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance
[05:10:37] <wikibugs>	 (03CR) 10Ebrahim: "11 days and no response, is it possible for you to have a look at the namespaces also and to list issues you see so I can at least go for " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim)
[05:13:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10169957 (10ABran-WMF) >>! In T375382#10168776, @VRiley-WMF wrote: > Hi! We do have a spare DIMM (32 gig, 2666mts) that we can swap at anytime for this unit. Please let us know when is the best...
[05:39:32] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] pc2017: Set it to master [puppet] - 10https://gerrit.wikimedia.org/r/1075052 (https://phabricator.wikimedia.org/T374355) (owner: 10Ladsgroup)
[05:49:32] <wikibugs>	 (03Abandoned) 10Arnaudb: mariadb: productionize db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1071570 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[05:55:18] <tappof>	 !log centrallog1002 upgrade to bookworm in progress https://phabricator.wikimedia.org/T353912
[05:55:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:57:30] <wikibugs>	 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375459 (10phaultfinder) 03NEW
[05:58:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10169971 (10ABran-WMF) >>! In T375257#10168468, @wiki_willy wrote: > Hi @ABran-WMF - can you check with the onsite engineers @VRiley-WMF and @Jclark-ctr?  Please also keep in mind this server is due to...
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T0600).
[06:02:27] <wikibugs>	 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375459#10169972 (10phaultfinder)
[06:03:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:28:10] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: productionize db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1075108 (https://phabricator.wikimedia.org/T373579)
[06:31:28] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:35:42] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet
[06:38:20] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:38:24] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:38:28] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:39:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10170007 (10ABran-WMF) Amazing! @Papaul thanks for the help!
[06:43:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:45:30] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:47:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073565 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson)
[06:55:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:56:30] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T0700).
[07:00:05] <jouncebot>	 dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:09] <dcausse>	 o/
[07:00:15] <dcausse>	 I can deploy
[07:01:20] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:01:24] <icinga-wm>	 RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 719, down: 13, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:01:28] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:02:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073565 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson)
[07:02:13] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet
[07:02:50] <wikibugs>	 (03Merged) 10jenkins-bot: Add a private variant of the cirrus update stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073565 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson)
[07:03:20] <logmsgbot>	 !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1073565|Add a private variant of the cirrus update stream (T374335)]]
[07:03:24] <stashbot>	 T374335: The SUP producer should ship private wiki update events to a separate stream - https://phabricator.wikimedia.org/T374335
[07:07:07] <logmsgbot>	 !log dcausse@deploy1003 dcausse, ebernhardson: Backport for [[gerrit:1073565|Add a private variant of the cirrus update stream (T374335)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:12:52] <logmsgbot>	 !log dcausse@deploy1003 dcausse, ebernhardson: Continuing with sync
[07:17:32] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] varnish: Occasional RSA cert connection warnings (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[07:23:28] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 674, down: 9, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:23:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:04-1] "Overall very nice job, just one correction." [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 (owner: 10Alexandros Kosiaris)
[07:27:31] <logmsgbot>	 !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073565|Add a private variant of the cirrus update stream (T374335)]] (duration: 24m 11s)
[07:27:35] <stashbot>	 T374335: The SUP producer should ship private wiki update events to a separate stream - https://phabricator.wikimedia.org/T374335
[07:28:14] <wikibugs>	 (03PS1) 10Slyngshede: Dummy Gitlab tokens for IDM. [labs/private] - 10https://gerrit.wikimedia.org/r/1075115 (https://phabricator.wikimedia.org/T359820)
[07:28:54] <dcausse>	 !log closing the backport window
[07:28:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:40] <wikibugs>	 (03PS1) 10Brouberol: Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950)
[07:33:04] <wikibugs>	 (03PS2) 10Brouberol: Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950)
[07:34:27] <wikibugs>	 (03PS3) 10Brouberol: Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950)
[07:36:24] <wikibugs>	 (03PS4) 10Brouberol: Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950)
[07:37:13] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] "approved by @aotto@wikimedia.org on phab task" [puppet] - 10https://gerrit.wikimedia.org/r/1073834 (https://phabricator.wikimedia.org/T375060) (owner: 10Vgutierrez)
[07:38:38] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10170093 (10Vgutierrez) 05Stalled→03In progress
[07:39:26] <wikibugs>	 (03PS5) 10Brouberol: Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950)
[07:41:45] <XioNoX>	 !log reboot cr3-ulsfo - T375345
[07:41:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:49] <stashbot>	 T375345: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345
[07:42:44] <wikibugs>	 (03CR) 10Hashar: Check that throttling exceptions use valid public IP addresses (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE))
[07:44:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:44:22] <icinga-wm>	 PROBLEM - Host cr3-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[07:44:46] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 62, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:44:46] <icinga-wm>	 PROBLEM - OSPF status on mr1-ulsfo is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:46:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10170101 (10ayounsi) From JTAC: > [...] after engaging further resources we have been requested to attempt a full chassis reboot and check if the issue persists before proceeding with the...
[07:47:10] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Dummy Gitlab tokens for IDM. [labs/private] - 10https://gerrit.wikimedia.org/r/1075115 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[07:47:18] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:47:46] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:47:46] <icinga-wm>	 RECOVERY - OSPF status on mr1-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:47:50] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 89, down: 9, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:49:24] <icinga-wm>	 RECOVERY - Host cr3-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.60 ms
[07:49:28] <wikibugs>	 (03PS1) 10Slyngshede: C:idm Add gitlab configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820)
[07:50:19] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10170102 (10Vgutierrez) 05In progress→03Resolved ` vgutierrez@krb1001:~$ sudo manage_principals.py create cyndywikime --email_address=csimi...
[07:50:33] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4094/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[07:51:16] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, a few comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[07:52:28] <wikibugs>	 (03PS2) 10Slyngshede: C:idm Add gitlab configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820)
[08:03:59] <wikibugs>	 (03CR) 10Muehlenhoff: "Also looks good, a few comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[08:06:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10170125 (10dcaro)
[08:07:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10170130 (10dcaro)
[08:11:35] <wikibugs>	 (03Abandoned) 10Hashar: do not merge: CI should no longer complain [puppet] - 10https://gerrit.wikimedia.org/r/862857 (owner: 10Jbond)
[08:14:56] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:20:37] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye
[08:25:29] <wikibugs>	 (03PS5) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820)
[08:28:32] <wikibugs>	 (03PS6) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820)
[08:29:15] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[08:29:45] <wikibugs>	 (03CR) 10Volans: deployment_server: mwscript_k8s usability tweaks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075098 (owner: 10RLazarus)
[08:29:58] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Thanks Milimetric for the review. I'm gonna try to get a +1 from the Data Persistence team as well, then I will merge and apply." [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[08:30:57] <logmsgbot>	 !log jnuche@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.24  refs T373643
[08:31:02] <stashbot>	 T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643
[08:31:06] <jnuche>	 train prep failed last night, I'm re-running it
[08:34:23] <wikibugs>	 (03PS7) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820)
[08:36:11] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223
[08:36:16] <stashbot>	 T375223: 2024-09-21 NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223
[08:36:25] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223
[08:36:38] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetmaster2001.codfw.wmnet with reason: WIP - working on puppet runs
[08:36:52] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetmaster2001.codfw.wmnet with reason: WIP - working on puppet runs
[08:37:17] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye
[08:37:45] <wikibugs>	 (03CR) 10Slyngshede: Account blocking (0311 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[08:38:44] <wikibugs>	 (03CR) 10Slyngshede: "See comment regarding moving blocking logic to bitu-ldap. It feels like the correct move, but I'd like to do that in a second step." [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[08:41:10] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:41:47] <jinxer-wm>	 FIRING: KubernetesCalicoDown: dse-k8s-worker1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:43:30] <logmsgbot>	 !log jnuche@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.24  refs T373643
[08:43:34] <stashbot>	 T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643
[08:46:13] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye
[08:49:46] <wikibugs>	 (03CR) 10Muehlenhoff: Account blocking (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[08:51:35] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1177.eqiad.wmnet with reason: host reimage
[08:53:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[08:55:41] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1177.eqiad.wmnet with reason: host reimage
[08:56:20] <wikibugs>	 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10170311 (10Jelto) I reviewed the [throttling in the past 7 days](https://grafana.wikimedia...
[08:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[08:56:42] <wikibugs>	 (03PS2) 10Stevemunene: hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1072660 (https://phabricator.wikimedia.org/T353788)
[08:57:32] <wikibugs>	 (03PS4) 10Stevemunene: hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788)
[08:59:51] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1176.eqiad.wmnet with reason: host reimage
[09:03:10] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1176.eqiad.wmnet with reason: host reimage
[09:03:34] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2426.codfw.wmnet
[09:04:12] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2426.codfw.wmnet
[09:04:36] <icinga-wm>	 PROBLEM - config-master.wikimedia.org requires authentication on puppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[09:04:45] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2427.codfw.wmnet
[09:05:00] <wikibugs>	 (03PS2) 10Hnowlan: mediawiki: remove check_mw_versions [puppet] - 10https://gerrit.wikimedia.org/r/1074189 (https://phabricator.wikimedia.org/T374860)
[09:05:19] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2427.codfw.wmnet
[09:05:47] <wikibugs>	 (03PS5) 10Stevemunene: hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788)
[09:07:33] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2426.codfw.wmnet with reason: reimage
[09:07:46] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2426.codfw.wmnet with reason: reimage
[09:07:53] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2427.codfw.wmnet with reason: reimage
[09:08:06] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2427.codfw.wmnet with reason: reimage
[09:09:22] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: debmonitor could provide users with cumin and/or debdeploy pre-made config/command - https://phabricator.wikimedia.org/T375475 (10fgiunchedi) 03NEW
[09:10:03] <wikibugs>	 (03CR) 10Btullis: [C:03+1] hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene)
[09:10:28] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] "Good catch, removed!" [puppet] - 10https://gerrit.wikimedia.org/r/1074189 (https://phabricator.wikimedia.org/T374860) (owner: 10Hnowlan)
[09:11:26] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1177.eqiad.wmnet with OS bullseye
[09:11:47] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-web/canary on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:14:24] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:14:26] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:14:47] <effie>	 jnuche: tell me what is you scap status 
[09:15:33] <jnuche>	 effie: still running, I'll give an update once it's finished
[09:15:45] <effie>	 while I depooled and cordoned and mark both hosts as unschedulable, I may still have stepped on your toes
[09:17:04] <jnuche>	 the deployment is rolling back, but this is a problem that already happened last night
[09:17:24] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 291, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:17:24] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 373, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:17:25] <jnuche>	 the deployments to K8s have become much slower and are timing out
[09:17:40] <effie>	 I see, is there a task? we should probably look into it 
[09:18:17] <jnuche>	 I haven't created a task yet, I think the issue may be related to https://phabricator.wikimedia.org/T366778
[09:18:56] <wikibugs>	 (03CR) 10Ayounsi: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi)
[09:19:11] <wikibugs>	 (03CR) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[09:19:13] <wikibugs>	 (03PS6) 10Ayounsi: Add monitoring to network devices gRPC endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1074435
[09:19:20] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi)
[09:19:37] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1176.eqiad.wmnet with OS bullseye
[09:20:10] <wikibugs>	 (03CR) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[09:21:32] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10170362 (10dcaro) Okok, let's take the 8 drives from cloudcephosd1025 on rack E4 to send them, let me drain it firs...
[09:21:46] <logmsgbot>	 !log jnuche@deploy1003 scap failed: <UnboundLocalError> local variable 'e' referenced before assignment (scap version: 4.104.0-1) (duration: 38m 15s)
[09:21:47] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release mw-web/canary on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:22:10] <jnuche>	 effie: rolled back completed, I'm creating the task
[09:22:13] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good. One final inline. One other thing we should consider in a separate follwup is notifications: We should probably send a notific" [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[09:23:22] <wikibugs>	 (03PS2) 10Slyngshede: Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820)
[09:24:22] <wikibugs>	 (03PS8) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820)
[09:25:09] <wikibugs>	 (03CR) 10Slyngshede: Account blocking (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[09:25:52] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw
[09:25:52] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=93) for the switch from eqiad to codfw
[09:26:50] <wikibugs>	 (03PS7) 10Filippo Giunchedi: Add monitoring to network devices gRPC endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi)
[09:27:40] <volans>	 arnaudb: what was this run of prepare?
[09:27:44] <tappof>	 !log upgrade mtail on lists* and ncredir* https://phabricator.wikimedia.org/T375085
[09:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:52] <arnaudb>	 yes but on patch sorry
[09:28:07] <arnaudb>	 I'm testing w/ jynus on a tmux 1073762
[09:28:48] <jnuche>	 effie: task https://phabricator.wikimedia.org/T375477
[09:28:55] <arnaudb>	 --no-sal-logging in use from now
[09:28:56] <arnaudb>	 sorry
[09:29:01] <volans>	 ah ok, no worries
[09:29:21] <effie>	 jnuche: tx 
[09:29:26] <volans>	 just checking, I was slightly worried, but also the cookbook should have enough checks at the start to prevent runs if it was already done
[09:29:48] <wikibugs>	 (03PS3) 10Andrew Bogott: Make cloudcephosd1039-1041 into ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1063892 (https://phabricator.wikimedia.org/T372814)
[09:31:50] <wikibugs>	 (03CR) 10David Caro: [C:03+2] Make cloudcephosd1039-1041 into ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1063892 (https://phabricator.wikimedia.org/T372814) (owner: 10Andrew Bogott)
[09:32:07] <wikibugs>	 (03PS8) 10Filippo Giunchedi: Add monitoring to network devices gRPC endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi)
[09:33:23] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.provision for host mw2426.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL
[09:34:12] <wikibugs>	 (03PS1) 10Cyndywikime: Drop support for the Old Impact Variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075148 (https://phabricator.wikimedia.org/T350077)
[09:35:46] <wikibugs>	 (03PS2) 10Cyndywikime: Drop support for the old impact variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075148 (https://phabricator.wikimedia.org/T350077)
[09:36:26] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:36:26] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:39:13] <wikibugs>	 (03CR) 10JMeybohm: Initial commit of containerd puppet code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm)
[09:39:17] <wikibugs>	 (03PS1) 10Effie Mouzeli: kubernetes: rename mw2426 -> wikikube-worker2126 [puppet] - 10https://gerrit.wikimedia.org/r/1075149 (https://phabricator.wikimedia.org/T372878)
[09:40:26] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 291, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:40:26] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 373, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:42:10] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] kubernetes: rename mw2426 -> wikikube-worker2126 [puppet] - 10https://gerrit.wikimedia.org/r/1075149 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli)
[09:42:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: rename mw2426 -> wikikube-worker2126 [puppet] - 10https://gerrit.wikimedia.org/r/1075149 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli)
[09:43:12] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: rename mw2426 -> wikikube-worker2126 [puppet] - 10https://gerrit.wikimedia.org/r/1075149 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli)
[09:46:33] <wikibugs>	 (03PS9) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820)
[09:46:38] <wikibugs>	 (03PS1) 10Btullis: partman: Enable use of the second disk for dse-k8s local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075151 (https://phabricator.wikimedia.org/T365283)
[09:46:50] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm
[09:47:18] <wikibugs>	 (03CR) 10Btullis: [C:03+2] partman: Enable use of the second disk for dse-k8s local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075151 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[09:48:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete rendering certs [puppet] - 10https://gerrit.wikimedia.org/r/1075152 (https://phabricator.wikimedia.org/T357750)
[09:48:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: git: add replicated_local_repo define (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto)
[09:49:01] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723)
[09:49:01] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723)
[09:49:01] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040
[09:49:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: service: make legacy function work with puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075153
[09:51:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto)
[09:51:56] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2426.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL
[09:52:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 (owner: 10Giuseppe Lavagetto)
[09:52:40] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.rename from mw2426 to wikikube-worker2126
[09:52:51] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[09:54:09] <_Gerges>	 Ping
[09:54:30] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:54:30] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:55:47] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[09:56:19] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2426 to wikikube-worker2126 - jiji@cumin1002"
[09:56:48] <wikibugs>	 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10170568 (10Jelto)
[09:56:54] <_Gerges>	 Hi, someone can take on both tasks T375055 and T375054, I'll be busy this weekend. 
[09:56:55] <stashbot>	 T375055: Requesting logo change for bjn.wikipedia.org - https://phabricator.wikimedia.org/T375055
[09:56:55] <stashbot>	 T375054: Requesting logo change for bjn.wikiquote.org - https://phabricator.wikimedia.org/T375054
[09:57:14] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2426 to wikikube-worker2126 - jiji@cumin1002"
[09:57:14] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:57:15] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2126
[09:57:47] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723)
[09:57:47] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723)
[09:57:47] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1000)
[10:00:18] <wikibugs>	 (03PS10) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820)
[10:00:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto)
[10:00:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 (owner: 10Giuseppe Lavagetto)
[10:01:28] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2126
[10:02:06] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2426 to wikikube-worker2126
[10:03:28] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2126.codfw.wmnet on all recursors
[10:03:31] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2126.codfw.wmnet on all recursors
[10:03:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:05:21] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2126.codfw.wmnet
[10:05:42] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2126.codfw.wmnet with OS bullseye
[10:05:52] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2126
[10:06:04] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[10:09:23] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage
[10:09:39] <wikibugs>	 10SRE-swift-storage: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448#10170593 (10MatthewVernon) I've confirmed that neither production swift cluster contains the object. And as far back as the swift logs go, we've only ever said 404 to requests for...
[10:10:07] <logmsgbot>	 !log jnuche@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.24  refs T373643
[10:10:12] <stashbot>	 T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643
[10:10:43] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:11:23] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2126 - jiji@cumin1002"
[10:11:27] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2126 - jiji@cumin1002"
[10:11:27] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:11:27] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2126.codfw.wmnet 82.0.192.10.in-addr.arpa 2.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:11:30] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2126.codfw.wmnet 82.0.192.10.in-addr.arpa 2.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:11:31] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2126
[10:11:53] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2126
[10:11:53] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2126
[10:13:20] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage
[10:13:44] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1176.eqiad.wmnet
[10:14:31] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1176.eqiad.wmnet
[10:14:59] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1176.eqiad.wmnet
[10:16:32] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1176.eqiad.wmnet
[10:16:46] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] Declare streams in support of the reconciliation mechanism for Dumps 2.0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo)
[10:17:08] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723)
[10:17:23] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1176.eqiad.wmnet
[10:19:48] <wikibugs>	 (03PS1) 10Effie Mouzeli: kubernetes: rename mw2427 -> wikikube-worker2127 [puppet] - 10https://gerrit.wikimedia.org/r/1075158 (https://phabricator.wikimedia.org/T372878)
[10:21:07] <wikibugs>	 (03PS2) 10Effie Mouzeli: kubernetes: rename mw2427 -> wikikube-worker2127 [puppet] - 10https://gerrit.wikimedia.org/r/1075158 (https://phabricator.wikimedia.org/T372878)
[10:22:59] <wikibugs>	 (03PS3) 10Effie Mouzeli: kubernetes: rename mw2427 -> wikikube-worker2127 [puppet] - 10https://gerrit.wikimedia.org/r/1075158 (https://phabricator.wikimedia.org/T372878)
[10:25:30] <godog>	 !log force deletion of older thanos blocks - T351927
[10:25:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:34] <stashbot>	 T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927
[10:25:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[10:25:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: rename mw2427 -> wikikube-worker2127 [puppet] - 10https://gerrit.wikimedia.org/r/1075158 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli)
[10:26:56] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "This change gets a +1 from me, but I'm also adding hnowlan for review." [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns)
[10:28:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:12] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2126.codfw.wmnet with reason: host reimage
[10:30:23] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:26] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "Thanks for the heads-up!" [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns)
[10:32:47] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[10:34:03] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2126.codfw.wmnet with reason: host reimage
[10:36:41] <wikibugs>	 (03CR) 10Btullis: [C:03+2] hieradata::services_proxy::envoy.yaml: fix duplicated port [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns)
[10:40:36] <wikibugs>	 (03CR) 10Gmodena: [C:03+2] dse-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena)
[10:41:34] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena)
[10:43:09] <wikibugs>	 (03PS1) 10Hnowlan: aqs: remove AQSv1 service components [puppet] - 10https://gerrit.wikimedia.org/r/1075163 (https://phabricator.wikimedia.org/T350143)
[10:44:12] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4095/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075163 (https://phabricator.wikimedia.org/T350143) (owner: 10Hnowlan)
[10:45:44] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1075153 (owner: 10Giuseppe Lavagetto)
[10:47:38] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): SSO domain shouldn't have a mobile version (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński)
[10:54:56] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:57:18] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2126.codfw.wmnet with OS bullseye
[10:58:52] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:58:57] <wikibugs>	 (03CR) 10Muehlenhoff: Block User: Add LDAP blocking/unblocking. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[11:10:56] <wikibugs>	 10SRE-swift-storage, 10media-backups: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448#10170718 (10jcrespo) Following with the usual preference, I would like to do a new upload rather than overwriting (for the end user it should have the same effe...
[11:12:45] <wikibugs>	 (03CR) 10FNegri: [C:03+1] [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[11:13:29] <wikibugs>	 10SRE-swift-storage, 10media-backups: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448#10170728 (10jcrespo) p:05Triage→03High
[11:13:35] <wikibugs>	 (03PS1) 10Slyngshede: Dummy secrets for IDM account blocking. [labs/private] - 10https://gerrit.wikimedia.org/r/1075174
[11:14:07] <wikibugs>	 10SRE-swift-storage, 10media-backups: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448#10170726 (10jcrespo) 05Open→03In progress a:03prabhat
[11:16:13] <wikibugs>	 (03CR) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[11:17:32] <wikibugs>	 (03PS2) 10Hnowlan: aqs: remove AQSv1 service components [puppet] - 10https://gerrit.wikimedia.org/r/1075163 (https://phabricator.wikimedia.org/T350143)
[11:19:34] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4096/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075163 (https://phabricator.wikimedia.org/T350143) (owner: 10Hnowlan)
[11:20:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[11:21:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10170740 (10MoritzMuehlenhoff)
[11:25:13] <wikibugs>	 (03PS1) 10Btullis: partman: Fix allocation of sdb for dse-k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1075181 (https://phabricator.wikimedia.org/T365283)
[11:25:56] <wikibugs>	 (03CR) 10Btullis: [C:03+2] partman: Fix allocation of sdb for dse-k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1075181 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[11:27:05] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Temporarily allow core password reset functionality [extensions/CentralAuth] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075182 (https://phabricator.wikimedia.org/T151012)
[11:27:16] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075182 (https://phabricator.wikimedia.org/T151012) (owner: 10Bartosz Dziewoński)
[11:27:27] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Dummy secrets for IDM account blocking. [labs/private] - 10https://gerrit.wikimedia.org/r/1075174 (owner: 10Slyngshede)
[11:28:28] <icinga-wm>	 PROBLEM - Host mc2038 is DOWN: PING CRITICAL - Packet loss = 100%
[11:29:13] <vgutierrez>	 expected?
[11:30:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Stop uploading puppet facts to PCC from puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399)
[11:30:45] <wikibugs>	 (03PS2) 10Muehlenhoff: Stop uploading puppet facts to PCC from puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399)
[11:30:46] <jynus>	 doesn't respond on ipv4 or ipv6
[11:31:20] <effie>	 !log homer cr*codfw* commit 'T372878'
[11:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:24] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[11:31:30] <effie>	 !log homer lsw1-a6-codfw* commit 'T372878'
[11:31:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:50] <effie>	 vgutierrez:  I will take a look in a bit, but it is not a problem the host being down 
[11:31:53] <moritzm>	 there's a CPU error logged on mc2038
[11:32:03] <moritzm>	 CPU 1 machine check error detected
[11:32:06] <moritzm>	 in SEL
[11:32:09] <effie>	 excellent
[11:32:37] <moritzm>	 under warranty for two more months fortunately
[11:33:08] <vgutierrez>	 just in time(TM)
[11:33:15] <effie>	 moritzm: can you please paste whatever you have in a phab paste? I will open a task 
[11:34:49] <moritzm>	 effie: sure: https://phabricator.wikimedia.org/P69399
[11:35:10] <effie>	 cheers tx
[11:36:09] <moritzm>	 I tried to have a look at the system itself, but seems the console died along, can't connect
[11:36:10] <wikibugs>	 (03PS1) 10Cyndywikime: Remove wgGEUseNewImpactModule config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075196 (https://phabricator.wikimedia.org/T350077)
[11:36:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) (owner: 10Muehlenhoff)
[11:38:27] <moritzm>	 !log installing systemd bugfix updates from Bookworm point release
[11:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:48] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2126.codfw.wmnet
[11:38:50] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2126.codfw.wmnet
[11:38:51] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2126.codfw.wmnet
[11:39:04] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply
[11:39:07] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply
[11:40:52] <wikibugs>	 (03PS2) 10Abijeet Patro: Enable translation settings banner for Test wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075189 (https://phabricator.wikimedia.org/T372460)
[11:40:56] <wikibugs>	 (03PS1) 10JMeybohm: prometheus::node_exporter: Don't exclude /var/lib/(docker|kubelet) [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488)
[11:42:36] <wikibugs>	 (03PS3) 10Slyngshede: C:idm Add configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820)
[11:42:46] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[11:44:32] <wikibugs>	 (03PS4) 10Slyngshede: C:idm Add configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820)
[11:44:45] <wikibugs>	 (03CR) 10Muehlenhoff: C:idm Add configuration for account blocking. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[11:44:54] <wikibugs>	 (03CR) 10D3r1ck01: [C:03+1] Temporarily allow core password reset functionality [extensions/CentralAuth] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075182 (https://phabricator.wikimedia.org/T151012) (owner: 10Bartosz Dziewoński)
[11:45:15] <wikibugs>	 (03PS2) 10Btullis: Add an airflow cluster and assign relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074391 (https://phabricator.wikimedia.org/T374932)
[11:45:15] <wikibugs>	 (03PS2) 10Btullis: Add a presto cluster and assign the relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074430 (https://phabricator.wikimedia.org/T374932)
[11:45:20] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4098/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[11:45:23] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply
[11:45:26] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply
[11:45:34] <wikibugs>	 (03PS2) 10JMeybohm: prometheus::node_exporter: Don't exclude /var/lib/(docker|kubelet) [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488)
[11:46:53] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4099/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm)
[11:46:58] <wikibugs>	 (03PS5) 10Slyngshede: C:idm Add configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820)
[11:47:09] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add a presto cluster and assign the relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074430 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis)
[11:47:27] <wikibugs>	 (03CR) 10Bartosz Dziewoński: SSO domain shouldn't have a mobile version (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński)
[11:47:31] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add an airflow cluster and assign relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074391 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis)
[11:47:46] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4100/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[11:48:20] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[11:48:52] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:48:57] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[11:52:52] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 289, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:54:56] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:56:41] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] C:idm Add configuration for account blocking. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[11:57:22] <wikibugs>	 (03PS5) 10Bartosz Dziewoński: SSO domain shouldn't have a mobile version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272)
[11:58:19] <wikibugs>	 (03CR) 10Ladsgroup: "I honestly prefer a way to set a section and run them manually so I have control over which sections should be done and in which order but" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans)
[11:58:20] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[11:58:54] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 371, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:59:31] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2427.codfw.wmnet with reason: reimage
[11:59:33] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2427.codfw.wmnet with reason: reimage
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1200)
[12:00:56] <wikibugs>	 (03Merged) 10jenkins-bot: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[12:01:34] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage
[12:05:13] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2427.codfw.wmnet
[12:05:14] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2427.codfw.wmnet
[12:05:27] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage
[12:05:44] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: rename mw2427 -> wikikube-worker2127 [puppet] - 10https://gerrit.wikimedia.org/r/1075158 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli)
[12:07:48] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:07:50] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:08:34] <wikibugs>	 (03PS1) 10Santiago Faci: MPIC: Deploying on staging a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473)
[12:08:49] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 289, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:08:50] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 371, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:09:42] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:09:46] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:12:28] <jynus>	 !log running db-compare on s2, s3 T375186
[12:12:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:33] <stashbot>	 T375186: databases preswitchover checks - https://phabricator.wikimedia.org/T375186
[12:12:48] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.provision for host mw2427.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL
[12:13:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[12:13:19] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[12:13:26] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[12:13:41] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#10170935 (10JMeybohm)
[12:14:36] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:14:42] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:15:05] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997)
[12:15:48] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:15:50] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:16:11] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[12:16:17] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[12:16:18] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2427.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL
[12:17:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[12:17:28] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.rename from mw2427 to wikikube-worker2127
[12:17:44] <logmsgbot>	 !log jiji@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw2427 to wikikube-worker2127
[12:19:49] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 289, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:19:50] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 371, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:20:13] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "This is just an idea, I'm not sure about it, but I hope it's a good one. Thoughts?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) (owner: 10Bartosz Dziewoński)
[12:21:03] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi)
[12:21:14] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.rename from mw2427 to wikikube-worker2127
[12:21:23] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[12:22:12] <wikibugs>	 (03Abandoned) 10Muehlenhoff: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[12:24:44] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add monitoring to network devices gRPC endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi)
[12:24:53] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[12:25:06] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update Tegola's Docker image to pick up package upgrades [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073818 (https://phabricator.wikimedia.org/T373976) (owner: 10Elukey)
[12:25:46] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on mc2038.codfw.wmnet with reason: CPU failure - T375495
[12:25:50] <stashbot>	 T375495: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495
[12:26:00] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on mc2038.codfw.wmnet with reason: CPU failure - T375495
[12:27:01] <wikibugs>	 10ops-codfw, 06DC-Ops: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495 (10jijiki) 03NEW
[12:27:38] <wikibugs>	 10ops-codfw, 06DC-Ops: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495#10171003 (10jijiki)
[12:28:05] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[12:28:10] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[12:28:25] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2427 to wikikube-worker2127 - jiji@cumin1002"
[12:29:10] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2427 to wikikube-worker2127 - jiji@cumin1002"
[12:29:10] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:29:11] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2127
[12:29:13] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997)
[12:29:17] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync
[12:29:25] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2127
[12:29:36] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync
[12:30:02] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997)
[12:30:04] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2427 to wikikube-worker2127
[12:30:28] <wikibugs>	 (03PS1) 10Slyngshede: Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075218
[12:30:46] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2127.codfw.wmnet on all recursors
[12:30:49] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync
[12:30:49] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2127.codfw.wmnet on all recursors
[12:30:50] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm as far as I can tell. After merging this the previous fix for gitlab-runners I06f578e23689c29be78eb888f1a8bbbf60b249f9 can be reverte" [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm)
[12:30:57] <wikibugs>	 (03PS3) 10Muehlenhoff: Stop uploading puppet facts to PCC from puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399)
[12:31:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:04-1] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) (owner: 10Muehlenhoff)
[12:31:19] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync
[12:31:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1075218 (owner: 10Slyngshede)
[12:32:06] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync
[12:32:08] <wikibugs>	 (03Abandoned) 10Slyngshede: Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[12:32:26] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2127.codfw.wmnet
[12:32:40] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync
[12:32:48] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2127.codfw.wmnet with OS bullseye
[12:32:58] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2127
[12:33:06] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[12:33:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10171032 (10jijiki) a:03jijiki
[12:35:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10171060 (10MoritzMuehlenhoff)
[12:35:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10171075 (10MoritzMuehlenhoff)
[12:35:58] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service cr1-esams.wikimedia.org:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[12:36:17] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2127 - jiji@cumin1002"
[12:36:21] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2127 - jiji@cumin1002"
[12:36:21] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:36:21] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2127.codfw.wmnet 83.0.192.10.in-addr.arpa 3.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:36:24] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2127.codfw.wmnet 83.0.192.10.in-addr.arpa 3.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:36:25] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2127
[12:36:47] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2127
[12:36:47] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2127
[12:37:47] <wikibugs>	 (03CR) 10Volans: "This is just the official list of all core sections that can be used in cookbooks as they wish. For example it can be used as an argument " [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans)
[12:38:13] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job probes/grpc in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:38:56] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] SSO domain shouldn't have a mobile version (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński)
[12:39:54] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:39:54] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:40:29] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[12:40:58] <jinxer-wm>	 FIRING: [4x] CertAlmostExpired: Certificate for service cr1-eqiad.wikimedia.org:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[12:43:14] <icinga-wm>	 RECOVERY - Disk space on thanos-be1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops
[12:43:14] <icinga-wm>	 RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops
[12:43:14] <icinga-wm>	 RECOVERY - Disk space on thanos-be2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops
[12:44:57] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[12:48:04] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[12:48:18] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723)
[12:48:18] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040
[12:48:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723)
[12:48:44] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[12:53:13] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job probes/grpc in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:54:33] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: canaries: Recreate instead of RollingUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075221 (https://phabricator.wikimedia.org/T375477)
[12:55:18] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2127.codfw.wmnet with reason: host reimage
[12:55:58] <jinxer-wm>	 FIRING: [10x] CertAlmostExpired: Certificate for service cr1-codfw.wikimedia.org:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[12:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[12:56:33] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] canaries: Recreate instead of RollingUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075221 (https://phabricator.wikimedia.org/T375477) (owner: 10Alexandros Kosiaris)
[12:58:04] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): scale back to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French)
[12:58:20] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2127.codfw.wmnet with reason: host reimage
[12:59:33] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075218 (owner: 10Slyngshede)
[12:59:51] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] C:idm Add configuration for account blocking. [puppet] - 10https://gerrit.wikimedia.org/r/1075141 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1300).
[13:00:05] <jouncebot>	 hnowlan and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:12] <Lucas_WMDE>	 o/
[13:00:26] <MatmaRex>	 hi
[13:00:28] <TheresNoTime>	 all yours Lucas :D
[13:00:36] <Lucas_WMDE>	 ok :3
[13:00:40] <hnowlan>	 o/
[13:01:00] <hnowlan>	 Mine is yet another only-takes-effect-on-jobrunners change that can't be tested on debug
[13:01:22] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Apply videoscaler request limits and wall clock time limits to shellbox-video (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[13:01:26] <Lucas_WMDE>	 ack
[13:01:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[13:01:45] <Lucas_WMDE>	 right, let’s try this merging without explicit rebase thing
[13:01:50] <Lucas_WMDE>	 since gerrit should now autorebase
[13:01:58] * Lucas_WMDE sees a lot of unnormalized errors in logspam-watch :/
[13:02:28] <wikibugs>	 (03Merged) 10jenkins-bot: Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075218 (owner: 10Slyngshede)
[13:02:32] * Lucas_WMDE looks how long CentralAuth CI usually takes
[13:02:34] <wikibugs>	 (03Merged) 10jenkins-bot: Apply videoscaler request limits and wall clock time limits to shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[13:02:36] <Lucas_WMDE>	 ten minutes, ok
[13:02:42] <Lucas_WMDE>	 then let’s not +2 that backport just yet I think
[13:02:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1073840|Apply videoscaler request limits and wall clock time limits to shellbox-video (T373517)]]
[13:03:02] <stashbot>	 T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517
[13:03:14] <icinga-wm>	 RECOVERY - Disk space on thanos-be1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops
[13:03:14] <icinga-wm>	 RECOVERY - Disk space on thanos-be1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1004&var-datasource=eqiad+prometheus/ops
[13:03:14] <icinga-wm>	 RECOVERY - Disk space on thanos-be2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops
[13:03:14] <icinga-wm>	 RECOVERY - Disk space on thanos-be2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops
[13:04:31] <wikibugs>	 (03PS1) 10Gmodena: dse-k8s-services: fix values in dump enrichment app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075226 (https://phabricator.wikimedia.org/T368787)
[13:04:56] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:05:01] <wikibugs>	 (03PS1) 10Slyngshede: P:idm Syntax error in settings. [puppet] - 10https://gerrit.wikimedia.org/r/1075227
[13:05:46] <MatmaRex>	 Lucas_WMDE: they say that CI has just gotten much faster, btw: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/XQZNOGXOJP62NSNHG24HIMOYWP5CG737/
[13:06:13] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idm Syntax error in settings. [puppet] - 10https://gerrit.wikimedia.org/r/1075227 (owner: 10Slyngshede)
[13:06:25] <Lucas_WMDE>	 ah, true
[13:06:36] <Lucas_WMDE>	 makes sense that selenium was the slowest of the jobs on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1075045 then
[13:06:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] canaries: Recreate instead of RollingUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075221 (https://phabricator.wikimedia.org/T375477) (owner: 10Alexandros Kosiaris)
[13:06:48] <Lucas_WMDE>	 (I don’t think that one is parallelized everywhere yet)
[13:07:00] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:08:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, hnowlan: Backport for [[gerrit:1073840|Apply videoscaler request limits and wall clock time limits to shellbox-video (T373517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:08:24] <stashbot>	 T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517
[13:08:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] envoy: Add support for passing an array of sets to the firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff)
[13:08:45] <ottomata>	 Lucas_WMDE: would you mind if we added a simple config change to the end of the window?  
[13:09:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, hnowlan: Continuing with sync
[13:09:29] <Lucas_WMDE>	 ottomata: sure, go ahead
[13:10:23] <wikibugs>	 10SRE-swift-storage, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Remove load_average check for ms-be/thanos-be - https://phabricator.wikimedia.org/T370526#10171322 (10fgiunchedi)
[13:10:29] <ottomata>	 Actually, Lucas_WMDE let me know when you are done with the window and I will deploy it. (easier than editing the calendar :p )
[13:11:26] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm
[13:11:29] <Lucas_WMDE>	 ottomata: ok :P
[13:11:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:12:00] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:12:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:04-1] "The underlying patch is now merged, but still need to be updated to use  firewall_src_sets" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[13:12:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:04-1] "The underlying patch is now merged, but still need to be updated to use  firewall_src_sets" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[13:12:40] <akosiaris>	 Lucas_WMDE: did by any chance canary "feel faster"? 
[13:13:02] <Lucas_WMDE>	 no idea, I wasn’t looking very closely at scap tbh
[13:13:07] <Lucas_WMDE>	 I can scroll up and see how long it took
[13:13:32] <Lucas_WMDE>	 sync-canaries-k8s was apparently 2m36s
[13:13:52] <akosiaris>	 that's faster alright
[13:14:10] <akosiaris>	 it was like 4+ previously
[13:14:17] <Lucas_WMDE>	 nice!
[13:14:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10171352 (10Papaul) @ABran-WMF you welcome.
[13:14:31] <jnuche>	 \o/
[13:15:11] <wikibugs>	 (03PS1) 10Slyngshede: P:idm add account managers for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1075230
[13:16:29] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] SSO domain shouldn't have a mobile version (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński)
[13:16:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:18:28] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2127.codfw.wmnet with OS bullseye
[13:20:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073840|Apply videoscaler request limits and wall clock time limits to shellbox-video (T373517)]] (duration: 17m 12s)
[13:20:11] <stashbot>	 T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517
[13:20:27] <hnowlan>	 Thanks Lucas_WMDE! 
[13:20:33] <Lucas_WMDE>	 np!
[13:20:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński)
[13:20:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] P:idm add account managers for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1075230 (owner: 10Slyngshede)
[13:21:09] <effie>	 !log homer lsw1-a6-codfw* commit 'T372878
[13:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:13] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[13:21:18] <wikibugs>	 (03Merged) 10jenkins-bot: SSO domain shouldn't have a mobile version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński)
[13:21:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1074418|SSO domain shouldn't have a mobile version (T375272)]]
[13:21:43] <stashbot>	 T375272: Beta cluster SSO domain has a mobile version, but shouldn't - https://phabricator.wikimedia.org/T375272
[13:23:14] <icinga-wm>	 RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops
[13:23:51] <kamila_>	 !log homer cr*codfw* commit T372878
[13:23:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:39] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075182 (https://phabricator.wikimedia.org/T151012) (owner: 10Bartosz Dziewoński)
[13:25:17] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye
[13:25:55] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm
[13:27:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1074418|SSO domain shouldn't have a mobile version (T375272)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:27:07] <stashbot>	 T375272: Beta cluster SSO domain has a mobile version, but shouldn't - https://phabricator.wikimedia.org/T375272
[13:27:38] <MatmaRex>	 Lucas_WMDE: that config change is currently only testable on the beta cluster
[13:27:43] <Lucas_WMDE>	 was just about to ask, yeah
[13:27:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Continuing with sync
[13:27:46] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2127.codfw.wmnet
[13:27:48] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2127.codfw.wmnet
[13:27:48] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2127.codfw.wmnet
[13:30:13] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 287, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:31:04] <Lucas_WMDE>	 (2m45s for sync-canaries-k8s this time btw)
[13:31:34] <wikibugs>	 (03CR) 10David Caro: "I think this broke puppet runs on toolforge prometheus :/" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi)
[13:32:44] <wikibugs>	 (03CR) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo)
[13:33:37] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Roll out nftables on both list hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073189
[13:34:08] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm
[13:34:30] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074418|SSO domain shouldn't have a mobile version (T375272)]] (duration: 12m 51s)
[13:34:35] <stashbot>	 T375272: Beta cluster SSO domain has a mobile version, but shouldn't - https://phabricator.wikimedia.org/T375272
[13:35:02] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily allow core password reset functionality [extensions/CentralAuth] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075182 (https://phabricator.wikimedia.org/T151012) (owner: 10Bartosz Dziewoński)
[13:35:17] <wikibugs>	 (03PS2) 10Cyndywikime: Remove wgGEUseNewImpactModule config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075196 (https://phabricator.wikimedia.org/T350077)
[13:35:41] <wikibugs>	 (03CR) 10Gmodena: [C:03+2] Declare streams in support of the reconciliation mechanism for Dumps 2.0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo)
[13:35:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1075182|Temporarily allow core password reset functionality (T151012)]]
[13:35:59] <stashbot>	 T151012: CentralAuth should have its own temporary password handling - https://phabricator.wikimedia.org/T151012
[13:36:16] <wikibugs>	 (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073189 (owner: 10EoghanGaffney)
[13:36:26] <wikibugs>	 (03Merged) 10jenkins-bot: Declare streams in support of the reconciliation mechanism for Dumps 2.0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo)
[13:37:19] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 369, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:37:41] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: cr3-ulsfo rebooted, repooling ulsfo, T375345]
[13:37:45] <stashbot>	 T375345: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345
[13:37:49] <wikibugs>	 (03PS1) 10Ayounsi: gNMI prometheus check: add specific network CA cert [puppet] - 10https://gerrit.wikimedia.org/r/1075235 (https://phabricator.wikimedia.org/T375513)
[13:37:53] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: cr3-ulsfo rebooted, repooling ulsfo, T375345]
[13:38:18] <Lucas_WMDE>	 hm, one testserver check failed
[13:38:31] <Lucas_WMDE>	 https://zero.wikipedia.org/ – expected 301, got 503
[13:38:39] <Lucas_WMDE>	 and https://login.wikimedia.org/wiki/Special:Log/renameuser – expected 200, got 503
[13:38:44] <Lucas_WMDE>	 (*two testserver checks)
[13:38:58] <Lucas_WMDE>	 let’s see if I can find those in logstash…
[13:39:35] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075235 (https://phabricator.wikimedia.org/T375513) (owner: 10Ayounsi)
[13:40:05] <Lucas_WMDE>	 grmbl, can’t find anything in logstash
[13:40:32] <Lucas_WMDE>	 I guess I’ll retry…
[13:40:38] <Lucas_WMDE>	 https://login.wikimedia.org/wiki/Special:Log/renameuser works for me on mwdebug2001 at least
[13:41:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1075182|Temporarily allow core password reset functionality (T151012)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:41:09] <MatmaRex>	 Lucas_WMDE: the CentralAuth change also isn't really testable on mwdebug - unless you requested a password reset on a wiki curretly running wmf.24 before wmf.24 went out
[13:41:11] <stashbot>	 T151012: CentralAuth should have its own temporary password handling - https://phabricator.wikimedia.org/T151012
[13:41:18] <Lucas_WMDE>	 ah, I see
[13:41:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Continuing with sync
[13:41:38] <Lucas_WMDE>	 let’s try our luck then
[13:41:50] <Lucas_WMDE>	 (retrying the testserver checks worked apparently btw)
[13:41:56] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idm add account managers for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1075230 (owner: 10Slyngshede)
[13:42:11] <Lucas_WMDE>	 o_O
[13:42:20] <moritzm>	 !log installing krb5 security updates
[13:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:34] <Lucas_WMDE>	 goddammit now I’ve got a cached redirect from wikitech to foundationwiki
[13:42:42] <Lucas_WMDE>	 which doesn’t go away even if I turn of WikimediaDebug again
[13:43:01] <Lucas_WMDE>	 (and I’m lucky that I even know that it’s related to WikimediaDebug at all because I happen to have heard of it somewhere)
[13:43:33] <Lucas_WMDE>	 yay, disabling cache in the network panel in dev tools fixed it…
[13:44:02] <wikibugs>	 (03CR) 10David Caro: "Hmm, is there a new blackbox-exporter version needed for the grpc config? I think it's not understanding it, line 19 of the generated conf" [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi)
[13:46:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075182|Temporarily allow core password reset functionality (T151012)]] (duration: 10m 21s)
[13:46:20] <stashbot>	 T151012: CentralAuth should have its own temporary password handling - https://phabricator.wikimedia.org/T151012
[13:46:28] <Lucas_WMDE>	 (0m34s for sync-canaries-k8s this time! :o)
[13:46:33] <Lucas_WMDE>	 (cc akosiaris)
[13:46:36] <Lucas_WMDE>	 ottomata: all yours
[13:47:07] <denisse>	 !incidents 
[13:47:08] <sirenbot>	 5274 (RESOLVED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[13:47:08] <sirenbot>	 5273 (RESOLVED)  db1246 (paged)/MariaDB Replica Lag: s2 (paged)
[13:47:08] <sirenbot>	 5272 (RESOLVED)  db1246 (paged)/MariaDB Replica SQL: s2 (paged)
[13:47:08] <sirenbot>	 5271 (RESOLVED)  db1246 (paged)/mysqld processes (paged)
[13:47:08] <sirenbot>	 5267 (RESOLVED)  Host pc1013 (paged) - PING  - Packet loss = 100%
[13:47:43] <akosiaris>	 Lucas_WMDE: woohoo!
[13:47:51] <ottomata>	 Lucas_WMDE: ty
[13:48:00] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage
[13:48:02] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1073189 (owner: 10EoghanGaffney)
[13:49:31] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Roll out nftables on both list hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073189 (owner: 10EoghanGaffney)
[13:50:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) (owner: 10Bartosz Dziewoński)
[13:50:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm)
[13:51:07] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) (owner: 10Joal)
[13:51:09] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage
[13:51:26] <wikibugs>	 (03PS4) 10Joal: EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498)
[13:51:36] <wikibugs>	 (03CR) 10Ottomata: [V:03+2 C:03+2] EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) (owner: 10Joal)
[13:51:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM from a Prometheus perspective" [puppet] - 10https://gerrit.wikimedia.org/r/1075235 (https://phabricator.wikimedia.org/T375513) (owner: 10Ayounsi)
[13:53:13] <godog>	 dcaro: ah yes pretty sure it is the blackbox-exporter version mismatch
[13:53:30] <godog>	 dcaro: prometheus runs on bookworm in production FWIW, I did an in-place upgrade a couple of weeks back and all was good
[13:53:52] <wikibugs>	 (03PS1) 10Ayounsi: Allow prometheus hosts to reach gnmi port [homer/public] - 10https://gerrit.wikimedia.org/r/1075237
[13:53:57] <logmsgbot>	 !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1074102|EventStreamConfig: Disable regex steam hadoop ingestion (T361498)]]
[13:54:01] <stashbot>	 T361498: [Refine Refactoring] Detect inactive event streams / Refine datasets using data recency thresholds - https://phabricator.wikimedia.org/T361498
[13:54:02] <dcaro>	 we still have bullseye on toolforge
[13:54:19] <dcaro>	 is there a version of the exporter available for bullseye somewhere?
[13:54:35] <godog>	 checking
[13:54:41] <dcaro>	 (should not be hard to upgrade the OS, but will need some time)
[13:55:53] <logmsgbot>	 !log otto@deploy1003 otto, joal: Backport for [[gerrit:1074102|EventStreamConfig: Disable regex steam hadoop ingestion (T361498)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:56:31] <wikibugs>	 (03PS1) 10Joal: Remove labswiki from HDFS imported dumps [puppet] - 10https://gerrit.wikimedia.org/r/1075238 (https://phabricator.wikimedia.org/T217792)
[13:57:17] <godog>	 dcaro: can't find a backported version for bullseye no :( it is an hack though copying just the binary should work
[13:59:26] <dcaro>	 hmmm, hacky, would it be too hard to package the binary only?
[13:59:36] <logmsgbot>	 !log otto@deploy1003 otto, joal: Continuing with sync
[13:59:38] <MatmaRex>	 Lucas_WMDE: thanks for deploying!
[13:59:49] <Lucas_WMDE>	 np!
[14:00:08] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] Remove labswiki from HDFS imported dumps [puppet] - 10https://gerrit.wikimedia.org/r/1075238 (https://phabricator.wikimedia.org/T217792) (owner: 10Joal)
[14:02:48] <ottomata>	 akosiaris: 38 seconds for canaries for me!
[14:03:12] <dcaro>	 godog: yep, using the 0.23 binary works ok
[14:03:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:03:56] <godog>	 dcaro: yeah, technically it is "just" the libc6 versioned dependency that makes the package uninstallable on bullseye
[14:04:01] <logmsgbot>	 !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074102|EventStreamConfig: Disable regex steam hadoop ingestion (T361498)]] (duration: 10m 04s)
[14:04:06] <godog>	 but that's a lie in practice
[14:04:15] <stashbot>	 T361498: [Refine Refactoring] Detect inactive event streams / Refine datasets using data recency thresholds - https://phabricator.wikimedia.org/T361498
[14:04:33] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1044.eqiad.wmnet with OS bookworm
[14:06:32] <jnuche>	 ottomata: heya, are you finished with your backports?
[14:07:18] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: docker: registry: Add option to redirect domain root [puppet] - 10https://gerrit.wikimedia.org/r/1075240 (https://phabricator.wikimedia.org/T375515)
[14:08:15] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1 C:03+2] prometheus::node_exporter: Don't exclude /var/lib/(docker|kubelet) [puppet] - 10https://gerrit.wikimedia.org/r/1075201 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm)
[14:08:23] <icinga-wm>	 PROBLEM - Host lists1004 is DOWN: CRITICAL - Host Unreachable (208.80.154.81)
[14:08:25] <swfrench-wmf>	 jnuche: ottomata: where do things stand on deployments at the moment? I have one patch I need to merge/apply to scale things up ahead of the traffic switchover happening at 15:00
[14:08:36] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm
[14:08:41] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4102/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075240 (https://phabricator.wikimedia.org/T375515) (owner: 10Majavah)
[14:08:49] <moritzm>	 libc6 uses versioned symbols, if it has the depende on a the specific libc6 that typically means that it use some feature which isn't available in the older libc (usually openat* etc.)
[14:08:52] <swfrench-wmf>	 this should be fairly quick (helmfile apply to a small subset of mediawiki deployments)
[14:09:07] <jnuche>	 swfrench-wmf: I can hold off for you
[14:09:08] <moritzm>	 that usually means we'd actually need to rebuild on the older OS/glibc
[14:09:44] <dcaro>	 moritzm: so you would expect it to crash eventually right? (whenever it tries to use those missing symbols)
[14:09:55] <icinga-wm>	 RECOVERY - Host lists1004 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[14:10:01] <moritzm>	 or not even start at all, yes
[14:10:23] <swfrench-wmf>	 jnuche: that would be great - thank you! I'll merge my patch now and give you a heads up when I'm clear.
[14:10:23] <moritzm>	 but worth a shot, depends on what features the go codebases uses
[14:11:22] <dcaro>	 it started and it seems to be running ok
[14:12:19] <godog>	 yeah I forget now how to check which actual symbols are implicated in the dependency, but anyways yes we're talking hacks 
[14:13:05] <wikibugs>	 (03CR) 10Btullis: "Thanks. I did initially set out to try to extend the file type, rather than create a custom type, but I ended up backtracking on it." [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[14:13:39] <wikibugs>	 (03PS2) 10Scott French: mw-(api-ext|web): scale back to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962)
[14:13:50] <wikibugs>	 (03PS1) 10JMeybohm: Revert "prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points" [puppet] - 10https://gerrit.wikimedia.org/r/1075242 (https://phabricator.wikimedia.org/T375488)
[14:14:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points" [puppet] - 10https://gerrit.wikimedia.org/r/1075242 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm)
[14:15:45] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): scale back to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French)
[14:15:54] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm
[14:16:49] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-ext|web): scale back to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French)
[14:17:41] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075240 (https://phabricator.wikimedia.org/T375515) (owner: 10Majavah)
[14:17:55] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[14:18:10] <wikibugs>	 (03PS2) 10JHathaway: puppetdb: Move JVM config out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[14:18:16] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[14:18:41] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[14:18:55] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[14:19:00] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[14:19:04] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: docker: registry: Add option to redirect domain root [puppet] - 10https://gerrit.wikimedia.org/r/1075240 (https://phabricator.wikimedia.org/T375515) (owner: 10Majavah)
[14:19:23] <swfrench-wmf>	 !log scaled up mw-api-ext ahead of traffic+services switchover - T370962
[14:19:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:27] <stashbot>	 T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962
[14:20:00] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-web: apply
[14:20:07] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1044.eqiad.wmnet with reason: host reimage
[14:20:31] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[14:21:13] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[14:21:30] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[14:21:47] <swfrench-wmf>	 !log scaled up mw-web ahead of traffic+services switchover - T370962
[14:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:10] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1044.eqiad.wmnet with reason: host reimage
[14:24:17] <logmsgbot>	 !log dcaro@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1039.eqiad.wmnet with OS bullseye
[14:26:07] <swfrench-wmf>	 jnuche: I should be out of your way now - thanks for your patience! be aware that since this makes the deployments larger, it _might_ also have the effect of slowing things down a bit (should not be large, though, given the amount of spare capacity we have in both k8s clusters).
[14:26:43] <wikibugs>	 (03PS2) 10JMeybohm: Revert "prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points" [puppet] - 10https://gerrit.wikimedia.org/r/1075242 (https://phabricator.wikimedia.org/T375488)
[14:27:07] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[14:27:08] <jnuche>	 swfrench-wmf: hmm, that's bad luck since we were struggling with timeouts, hopefully it will still be ok
[14:27:25] <jnuche>	 however it's probably too late now for the train prsync before we hit the infra window in 30m
[14:27:30] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075245 (https://phabricator.wikimedia.org/T374387)
[14:27:31] <godog>	 dcaro: can I help with blackbox-exporter upgrade on bullseye ?
[14:27:33] <jnuche>	 I'll try again after that
[14:28:26] <jnuche>	 swfrench-wmf: will you let me know after the DC switchover is complete?
[14:28:54] <dcaro>	 godog: maybe, the process is quite manual though, create a new prometheus VM, move the volume and make sure everything works, you need to be a toolforge root though
[14:29:01] <dcaro>	 T375523
[14:29:02] <stashbot>	 T375523: [toolforge-prometheus] upgrade to bookworm - https://phabricator.wikimedia.org/T375523
[14:29:45] <godog>	 dcaro: is in-place upgrade to bookworm on the table for this specific problem/issue ?
[14:29:53] <swfrench-wmf>	 jnuche: yes, I'll post here when things are stable
[14:30:03] <jnuche>	 swfrench-wmf: thx!
[14:30:18] <dcaro>	 godog: nah, not worth it I think, I've manually deployed the new binaries, I'll keep an eye see if they crash but it seems to be working ok
[14:30:39] <godog>	 dcaro: ok SGTM, good enough of a bandaid
[14:30:42] <dcaro>	 the thing is that VM-wise, in-place upgrades make tracking the OS of the VM complicated and such
[14:31:01] <godog>	 sorry about the blindside there, didn't realize grpc is bookworm-only
[14:31:20] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage
[14:32:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[14:33:58] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage
[14:37:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:38:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:44:32] <icinga-wm>	 RECOVERY - Host mc2038 is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms
[14:47:57] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:47:58] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1044.eqiad.wmnet with OS bookworm
[14:50:22] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm
[14:53:40] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm
[14:53:57] <XioNoX>	 dcaro: thanks!
[14:56:13] <wikibugs>	 (03CR) 10Ottomata: "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[15:00:05] <jouncebot>	 swfrench-wmf: May I have your attention please! Southward Datacenter Switchover: Services + Traffic. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1500)
[15:00:05] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1500).
[15:00:43] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:01:11] <swfrench-wmf>	 !log starting switchover day 1 - T370962
[15:01:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:16] <stashbot>	 T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962
[15:01:52] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: Datacenter Switchover, T370962]
[15:02:12] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: Datacenter Switchover, T370962]
[15:04:12] <swfrench-wmf>	 going to monitor how things progress over the next 15-20m before moving on to the next step
[15:04:28] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:39] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: sync
[15:09:02] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: sync
[15:09:25] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage
[15:09:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Oops, sorry, that slipped through the cracks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075226 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena)
[15:13:10] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage
[15:15:58] <wikibugs>	 (03PS1) 10JHathaway: puppetboard: use stdlib version of to_python [puppet] - 10https://gerrit.wikimedia.org/r/1075254
[15:16:23] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075254 (owner: 10JHathaway)
[15:22:11] <swfrench-wmf>	 things are looking good, and I'm going to proceed with depooling discovery services from eqiad
[15:22:23] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None
[15:22:26] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None
[15:23:42] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T370962
[15:23:48] <stashbot>	 T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962
[15:26:11] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075254 (owner: 10JHathaway)
[15:27:54] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] puppetboard: use stdlib version of to_python [puppet] - 10https://gerrit.wikimedia.org/r/1075254 (owner: 10JHathaway)
[15:29:27] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm
[15:32:07] <wikibugs>	 (03PS1) 10JHathaway: wmflib::to_python: remove [puppet] - 10https://gerrit.wikimedia.org/r/1075258
[15:33:10] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] wmflib::to_python: remove [puppet] - 10https://gerrit.wikimedia.org/r/1075258 (owner: 10JHathaway)
[15:34:51] <ottomata>	 swfrench-wmf: i was about to do a rolling restart of eventgate-analytlics to pick up a config change.  should I not?
[15:38:50] <ottomata>	 i think the rolling restart shouldn't affect the traffic routing and switchover stuff,  so i'm going to proceed
[15:38:56] <swfrench-wmf>	 ottomata: if it's a fairly low risk change, I don't see any reason why not. be aware that the service is now depooled from eqiad and serving solely from codfw - i.e., to note while you're verifying things (and also bear in mind that codfw is serving all traffic now).
[15:38:57] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10172078 (10Krinkle)
[15:39:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10172072 (10Papaul) 05Open→03Resolved This is done
[15:39:09] <ottomata>	 swfrench-wmf: great thank you
[15:39:25] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync
[15:39:43] <ottomata>	 !log rolling restart of eventgate-analytics to pick up https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073855
[15:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:58] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync
[15:40:06] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: sync
[15:40:55] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync
[15:41:00] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync
[15:41:08] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync
[15:43:23] <wikibugs>	 (03PS3) 10Ebrahim: Remove metawiki dark mode exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623
[15:44:50] <wikibugs>	 (03PS3) 10Jdlrobson: Dark mode: Make LiquidThreads namespace explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562
[15:45:23] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Update codfw LVS connectivity to support new LSW in rows C & D - https://phabricator.wikimedia.org/T370635#10172089 (10Papaul) 05Open→03Resolved a:03Papaul This is also done we can resolve.
[15:48:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#10172117 (10Papaul) 05Open→03Resolved a:03Papaul This is done we are tracking the decom  in https://phabricator.wikimedia.org/T375419 and https://...
[15:48:15] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10172128 (10Papaul) 05Open→03Resolved a:03Papaul This is done
[15:49:12] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in eqiad: Datacenter Switchover - T370962
[15:49:18] <stashbot>	 T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962
[15:50:12] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10172139 (10Papaul)
[15:50:17] <swfrench-wmf>	 !log switchover day 1 actions are complete - T370962
[15:50:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:23] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:52:46] <swfrench-wmf>	 hmmm ... thumbor
[15:54:01] <akosiaris>	 It might require more replicas, I think we had that in some past switchover
[15:54:35] <wikibugs>	 (03PS1) 10Mforns: Correct port number for data-gateway in commons impact analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075265 (https://phabricator.wikimedia.org/T368035)
[15:54:41] <dcaro>	 godog: np, stuff happens :), XioNoX yw!
[15:55:01] <swfrench-wmf>	 akosiaris: ah, that's plausible, I was aware of the issue with the discovery record last time around, but not capacity
[15:57:23] <duesen>	 Hi all! I'd like to deploy a fix for a minor issue (missing vary header) with some REST endpoints in the next hour. Any concerns?
[15:57:44] <duesen>	 swfrench-wmf: I was told you are fiddling with things, would a backport deployment be ok with you?
[15:58:21] <brennen>	 jouncebot nowandnext
[15:58:21] <jouncebot>	 For the next 0 hour(s) and 1 minute(s): Southward Datacenter Switchover: Services + Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1500)
[15:58:21] <jouncebot>	 For the next 0 hour(s) and 1 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1500)
[15:58:21] <jouncebot>	 In 0 hour(s) and 1 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1600)
[15:58:30] <swfrench-wmf>	 duesen: of you could hold off for now, that would be greatly appreciatd - we're still troubleshooting an issue
[15:58:34] <swfrench-wmf>	 *if you'
[16:00:05] <jouncebot>	 jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:07] <rzl>	 (nothing in today's puppet window, the conch is still swfrench-wmf's)
[16:00:28] <brennen>	 swfrench-wmf, duesen: once things are clear, o
[16:00:49] <brennen>	 er, sorry - once things are clear, i'm needing to run a train presync.  fine to have a backport go out first though.
[16:00:55] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] Correct port number for data-gateway in commons impact analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075265 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns)
[16:01:29] <brennen>	 (again, once things are clear - no pressure on my end.)
[16:02:02] <wikibugs>	 (03Merged) 10jenkins-bot: Correct port number for data-gateway in commons impact analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075265 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns)
[16:02:19] <duesen>	 brennen: I'm not ready yet, the patch needs to pass CI first. I hope to be ready in about 40 minutes.
[16:03:39] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10172197 (10jhathaway) >>! In T370786#10169691, @Eevans wrote: > We need to make sure Corto behaves well in a channels with existing bots, so the command namespace needs to be unique; Corto should...
[16:04:17] <swfrench-wmf>	 brennen: ack, thanks - I'll let you know
[16:06:00] <logmsgbot>	 !log swfrench@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=swift-ro,name=eqiad
[16:06:59] <swfrench-wmf>	 !log repooled swift-ro in eqiad to potentially mitigate issues with thumbor - T370962
[16:07:04] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997)
[16:07:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:05] <stashbot>	 T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962
[16:07:09] <logmsgbot>	 !log mforns@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply
[16:07:26] <logmsgbot>	 !log mforns@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply
[16:08:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:09:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[16:09:55] <sukhe>	 hello
[16:09:57] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:09:59] <sukhe>	 !incidents
[16:09:59] <sirenbot>	 5276 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[16:09:59] <sirenbot>	 5277 (UNACKED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[16:10:00] <sirenbot>	 5274 (RESOLVED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[16:10:00] <sirenbot>	 5273 (RESOLVED)  db1246 (paged)/MariaDB Replica Lag: s2 (paged)
[16:10:00] <sirenbot>	 5272 (RESOLVED)  db1246 (paged)/MariaDB Replica SQL: s2 (paged)
[16:10:00] <sirenbot>	 5271 (RESOLVED)  db1246 (paged)/mysqld processes (paged)
[16:10:03] <sukhe>	 !ack 5276
[16:10:03] <sirenbot>	 5276 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[16:10:04] <sukhe>	 !ack 5277
[16:10:04] <sirenbot>	 5277 (ACKED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[16:10:12] <denisse>	 Here.
[16:10:27] <sukhe>	 swift-ro was repooled in eqiad right?
[16:10:37] <sukhe>	 denisse: ACKed them both 
[16:10:38] <swfrench-wmf>	 sukhe: yes, exactly
[16:10:43] <denisse>	 sukhe: yes, that could explain the page.
[16:12:50] <sfaci>	 !log Deploying Refinery
[16:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:10] <sukhe>	 hmm that doesn't seem to be getting better
[16:13:15] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:13:26] <hnowlan>	 I suspect we might need to pool the `swift` dnsdisc also 
[16:13:31] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] rdf-streaming-updater: use SSL and external-services fqdn to access kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231 (https://phabricator.wikimedia.org/T333373) (owner: 10DCausse)
[16:13:51] <sukhe>	 that was done it seems? 
[16:13:52] <sukhe>	 set/pooled=true; selector: dnsdisc=swift-ro,name=eqiad
[16:14:17] <wikibugs>	 (03PS1) 10Daniel Kinzler: REST: vary on x-restbase-compat header if present [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075269 (https://phabricator.wikimedia.org/T374136)
[16:14:23] <swfrench-wmf>	 alright, so the current status is as follows: errors started a bit after 15:40, which is a bit after we depooled swift-ro in eqiad
[16:14:35] <sukhe>	 swfrench-wmf: yeah matches exactly
[16:14:39] <logmsgbot>	 !log sfaci@deploy1003 Started deploy [analytics/refinery@cdcefda]: Regular analytics weekly train [analytics/refinery@cdcefda6]
[16:14:45] <hnowlan>	 sukhe: there's a dnsdisc service for `swift` with no -ro or -rw also 
[16:14:54] <sukhe>	 hnowlan: ah
[16:15:05] <_joe_>	 I would guess that's the one used
[16:15:25] <swfrench-wmf>	 bah, it'll hit that one now
[16:15:36] <sukhe>	 swfrench-wmf: ok thanks
[16:15:40] <swfrench-wmf>	 *I'll repool that in eqiad now
[16:15:45] <_joe_>	 also remember there's a 5 minutes delay if you're not wiping caches
[16:16:21] <sukhe>	 we can wipe them but I will hold off
[16:16:35] <logmsgbot>	 !log swfrench@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad
[16:16:39] <logmsgbot>	 !log swfrench@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad
[16:17:09] <swfrench-wmf>	 I hit enter too quickly, but yes, should be re-pooling now, modulo the delay _joe_ mentioned
[16:17:29] <hnowlan>	 seeing an increase in thumbor loads in eqiad already 
[16:17:42] <_joe_>	 you can check the ats config and see it mentions swift.discovery.wmnet, indeed
[16:17:45] <sukhe>	 errors going down as well
[16:17:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495#10172257 (10Jhancock.wm) a:05Papaul→03Jhancock.wm so I swapped the CPUs, and it boots. but the idrac won't connect. when I connect a front plate the ip shows as 0...
[16:18:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:18:28] <swfrench-wmf>	 !log repooled swift in eqiad to potentially mitigate issues with thumbor and swift - T370962
[16:18:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:34] <stashbot>	 T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962
[16:18:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc2038.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED
[16:19:14] <wikibugs>	 (03PS2) 10Daniel Kinzler: REST: vary on x-restbase-compat header if present [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075269 (https://phabricator.wikimedia.org/T374136)
[16:19:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes mw2424 and mw2425 - https://phabricator.wikimedia.org/T375398#10172251 (10Jhancock.wm) a:03Jhancock.wm wrong ticket
[16:19:32] <swfrench-wmf>	 that's looking a lot better
[16:19:41] <sukhe>	 yep
[16:19:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[16:19:52] <sukhe>	 so probably a note on the swift vs swift-ro,rw distinction
[16:19:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:20:53] <swfrench-wmf>	 sukhe: yeah, I can investigate the history a bit more today and document and / or add the service to the exclusion list
[16:22:25] <logmsgbot>	 !log sfaci@deploy1003 Finished deploy [analytics/refinery@cdcefda]: Regular analytics weekly train [analytics/refinery@cdcefda6] (duration: 07m 46s)
[16:22:45] <logmsgbot>	 !log sfaci@deploy1003 Started deploy [analytics/refinery@cdcefda] (thin): Regular analytics weekly train THIN [analytics/refinery@cdcefda6]
[16:23:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: mw2427.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2427.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:24:28] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc2038.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED
[16:26:42] <wikibugs>	 (03PS2) 10RLazarus: deployment_server: mwscript_k8s usability tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1075098
[16:27:20] <logmsgbot>	 !log sfaci@deploy1003 Finished deploy [analytics/refinery@cdcefda] (thin): Regular analytics weekly train THIN [analytics/refinery@cdcefda6] (duration: 04m 34s)
[16:27:37] <logmsgbot>	 !log sfaci@deploy1003 Started deploy [analytics/refinery@cdcefda] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@cdcefda6]
[16:28:01] <swfrench-wmf>	 denisse: sukhe: alright, I believe the dust has settled here. thank you both for your assistance and patience with this bit of switchover fallout
[16:28:13] <sukhe>	 <3
[16:28:25] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] Remove labswiki from HDFS imported dumps [puppet] - 10https://gerrit.wikimedia.org/r/1075238 (https://phabricator.wikimedia.org/T217792) (owner: 10Joal)
[16:28:33] <denisse>	 swfrench-wmf: ACK, thank you! :)
[16:29:29] <wikibugs>	 (03CR) 10RLazarus: deployment_server: mwscript_k8s usability tweaks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075098 (owner: 10RLazarus)
[16:30:48] <wikibugs>	 (03CR) 10Aaron Schulz: [C:03+2] REST: vary on x-restbase-compat header if present [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075269 (https://phabricator.wikimedia.org/T374136) (owner: 10Daniel Kinzler)
[16:30:50] <logmsgbot>	 !log sfaci@deploy1003 Finished deploy [analytics/refinery@cdcefda] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@cdcefda6] (duration: 03m 13s)
[16:31:17] <brennen>	 swfrench-wmf: clear for some train operations?
[16:31:42] <swfrench-wmf>	 brennen: was just about to say, heh - yes, at this point, I think you should be good to go
[16:31:47] <brennen>	 thanks!
[16:33:45] <Dreamy_Jazz>	 logstash.wikimedia.org seems broken to me. Our team's dashboard has disappeared and the home page seems to be an outdated version.
[16:34:14] <Dreamy_Jazz>	 When I view logstash.wikimedia.org I see "AHT Team" which was removed a while ago
[16:34:23] <Dreamy_Jazz>	 And no "Trust and Safety Product"
[16:35:06] <Dreamy_Jazz>	 Plus when I open our team's dashboard from a link we have saved in a google doc, there is an error saying the dashboard does not exist. The URL we have saved is https://logstash.wikimedia.org/app/dashboards#/view/bc0caa20-92d5-11ee-b8fa-893e52d5cd7d?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-1w%2Cto%3Anow))
[16:35:24] <cwhite>	 that's no good - it appears to be a product of the failover to codfw
[16:35:38] * cwhite looks into rolling that back
[16:35:47] <cwhite>	 swfrench-wmf: ^^
[16:36:01] <wikibugs>	 (03PS5) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997)
[16:36:09] <swfrench-wmf>	 ah, interesting!
[16:36:25] <swfrench-wmf>	 sounds like it might make sense to switch logstash back to eqiad
[16:36:36] <swfrench-wmf>	 cwhite: SGTY?
[16:36:43] <cwhite>	 yes, please :)
[16:36:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495#10172303 (10Jhancock.wm) provisioning script failed. it can't get a connection to the server. set the idrac manually, but still doesn't ping. This is still under warr...
[16:37:54] <logmsgbot>	 !log swfrench@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=logstash,name=eqiad
[16:38:06] <logmsgbot>	 !log swfrench@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=logstash,name=codfw
[16:38:58] <swfrench-wmf>	 !log switched logstash.discovery.wmnet back to eqiad due to reports of stale dashboards - T370962
[16:39:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:08] <stashbot>	 T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962
[16:39:50] <Dreamy_Jazz>	 Do you know how long the cache might last for switching back to eqiad?
[16:39:56] <wikibugs>	 (03CR) 10Volans: "LGTM, non voting just because I don't have the specific context for the change." [puppet] - 10https://gerrit.wikimedia.org/r/1075098 (owner: 10RLazarus)
[16:40:27] <cwhite>	 Dreamy_Jazz: give your browser a hard refresh and let us know if the issue is corrected?
[16:40:40] <swfrench-wmf>	 Dreamy_Jazz: it should happen with 5m, but I can clear the recursor caches right now
[16:40:53] <Dreamy_Jazz>	 It's fixed for me now. Thanks.
[16:41:13] <cwhite>	 Thank you swfrench-wmf!
[16:41:18] <logmsgbot>	 !log brennen@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.24  refs T373643
[16:41:25] <stashbot>	 T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643
[16:41:26] <swfrench-wmf>	 glad to hear that did it
[16:41:36] <swfrench-wmf>	 cwhite: no problem at all, and thanks for flagging!
[16:42:20] <cwhite>	 <3
[16:48:00] <duesen>	 swfrench-wmf, brennen: CI is in a bad mood, we'll do it later in the day
[16:49:38] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: Corto: Access model (MVP only) - https://phabricator.wikimedia.org/T375451#10172328 (10jhathaway) I think keeping everything has private is the best path for the MVP.
[16:50:09] <brennen>	 duesen: ack
[16:51:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] REST: vary on x-restbase-compat header if present [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075269 (https://phabricator.wikimedia.org/T374136) (owner: 10Daniel Kinzler)
[16:52:06] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10172323 (10ovasileva) p:05High→03Medium Potential next steps in addition to the tickets above is QTE preparation and test cases.
[16:53:30] <wikibugs>	 (03CR) 10Btullis: "I have a question though. Why can't we just enable dumps of labswiki, now that it has been moved to the core DB servers?" [puppet] - 10https://gerrit.wikimedia.org/r/1075238 (https://phabricator.wikimedia.org/T217792) (owner: 10Joal)
[16:54:22] <sfaci>	 !log Deployed refinery using scap, then deployed onto hdfs
[16:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[16:57:53] <logmsgbot>	 !log sfaci@deploy1003 Started deploy [airflow-dags/analytics@e1fb17b]: deploying new datahub ingestion
[16:58:30] <logmsgbot>	 !log sfaci@deploy1003 Finished deploy [airflow-dags/analytics@e1fb17b]: deploying new datahub ingestion (duration: 00m 52s)
[16:59:30] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc2038.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1700)
[17:01:13] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] vrts_aliases: add a basic safeguard [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) (owner: 10JHathaway)
[17:02:14] <wikibugs>	 (03Merged) 10jenkins-bot: REST: vary on x-restbase-compat header if present [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075269 (https://phabricator.wikimedia.org/T374136) (owner: 10Daniel Kinzler)
[17:02:26] <wikibugs>	 (03PS2) 10JHathaway: vrts_aliases: add a basic safeguard [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090)
[17:03:01] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2038.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED
[17:06:03] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[4-5] - https://phabricator.wikimedia.org/T372512#10172392 (10Jhancock.wm) @colewhite I might need a moment with these servers. getting errors on prometheus. T375328
[17:07:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495#10172394 (10Jhancock.wm) @jijiki papaul helped me get it singable and the system is back online. The parts have been swapped and we might need to test under full load...
[17:09:03] <logmsgbot>	 !log brennen@deploy1003 Finished scap sync-world: testwikis to 1.43.0-wmf.24  refs T373643 (duration: 27m 45s)
[17:09:10] <stashbot>	 T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643
[17:10:35] <brennen>	 !log train presync finished successfully; going AFK for ~45 minutes but will return to roll group0 during train window (T373643, T375477)
[17:10:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:43] <stashbot>	 T375477: Helm deployment timeouts during train presync - https://phabricator.wikimedia.org/T375477
[17:11:28] <wikibugs>	 (03CR) 10Hokwelum: [C:03+1] Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 (owner: 10Krinkle)
[17:17:34] <wikibugs>	 (03CR) 10Btullis: Add an hdfs_file type and provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[17:19:16] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:04-1] "I tested this on the beta cluster and it doesn't work. I don't know why, I will try debugging it some other time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) (owner: 10Bartosz Dziewoński)
[17:21:50] <wikibugs>	 (03PS1) 10Bking: admin-ng: add airflow namespaces to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075278 (https://phabricator.wikimedia.org/T374948)
[17:34:54] <wikibugs>	 (03PS19) 10Ssingh: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins)
[17:37:52] <wikibugs>	 (03PS1) 10Stoyofuku-wmf: Deploy donate link to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585)
[17:39:42] <wikibugs>	 (03CR) 10Ottomata: Add an hdfs_file type and provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[17:40:36] <wikibugs>	 (03CR) 10Ottomata: Add an hdfs_file type and provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[17:47:47] <wikibugs>	 (03PS1) 10Reedy: MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280
[17:52:12] <wikibugs>	 (03CR) 10Jforrester: [C:04-1] "Let's not regress portals." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 (owner: 10Reedy)
[17:52:25] <wikibugs>	 (03PS2) 10Jforrester: MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 (owner: 10Reedy)
[17:52:31] <Reedy>	 lol
[18:00:05] <jouncebot>	 brennen and jnuche: Time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T1800).
[18:01:20] <duesen>	 swfrench-wmf, brennen: Can I go ahead with the backport now? The train already rolled to group0 earlier today, right? I could also wait for the backport window, but it's getting late here...
[18:02:46] <swfrench-wmf>	 duesen: no objections on my end, but I'd probably defer to brennen here
[18:03:08] <Amir1>	 duesen: the train is not rolled out to group0 and brennen is actually planning to do it during this window 
[18:03:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:03:37] <duesen>	 Amir1: ah ok, thanks. Let's hop on our sync call, then :)
[18:03:54] <wikibugs>	 (03PS3) 10Reedy: MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280
[18:05:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 (owner: 10Reedy)
[18:05:26] <wikibugs>	 (03PS4) 10Reedy: MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280
[18:05:37] <duesen>	 brennen: Amir just reminded me that the patch will actually ride the train since it's already merged into wmf.24. Here it is: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1075269
[18:06:03] <wikibugs>	 (03PS1) 10Cwhite: es-exporter: add wikifunctions queries [puppet] - 10https://gerrit.wikimedia.org/r/1075284 (https://phabricator.wikimedia.org/T371426)
[18:06:18] <wikibugs>	 (03PS8) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837)
[18:06:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] es-exporter: add wikifunctions queries [puppet] - 10https://gerrit.wikimedia.org/r/1075284 (https://phabricator.wikimedia.org/T371426) (owner: 10Cwhite)
[18:07:11] <wikibugs>	 (03CR) 10BCornwall: varnish: Occasional RSA cert connection warnings (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[18:08:56] <wikibugs>	 (03PS2) 10Cwhite: es-exporter: add wikifunctions queries [puppet] - 10https://gerrit.wikimedia.org/r/1075284 (https://phabricator.wikimedia.org/T371426)
[18:09:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:10:01] <brennen>	 o/
[18:10:48] <wikibugs>	 (03PS3) 10Cwhite: es-exporter: add wikifunctions queries [puppet] - 10https://gerrit.wikimedia.org/r/1075284 (https://phabricator.wikimedia.org/T371426)
[18:10:53] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075289 (https://phabricator.wikimedia.org/T373643)
[18:10:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075289 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot)
[18:11:03] <brennen>	 going ahead to group0.
[18:11:38] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075289 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot)
[18:14:51] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Deploy donate link to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf)
[18:21:34] <logmsgbot>	 !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.24  refs T373643
[18:21:41] <stashbot>	 T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643
[18:35:08] <wikibugs>	 (03PS3) 10RLazarus: deployment_server: mwscript_k8s usability tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1075098
[18:42:35] <wikibugs>	 (03PS9) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837)
[18:44:26] <wikibugs>	 (03PS1) 10BCornwall: varnish: Cast test resources to str [puppet] - 10https://gerrit.wikimedia.org/r/1075300
[18:48:33] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1075098 (owner: 10RLazarus)
[18:49:20] <wikibugs>	 (03CR) 10Nik Gkountas: [C:03+1] ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075245 (https://phabricator.wikimedia.org/T374387) (owner: 10Kevin Bazira)
[18:50:29] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] "Thanks both!" [puppet] - 10https://gerrit.wikimedia.org/r/1075098 (owner: 10RLazarus)
[18:53:52] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus1008.eqiad.wmnet
[19:00:54] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1008.eqiad.wmnet
[19:04:43] <wikibugs>	 (03PS10) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837)
[19:05:19] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1075300 (owner: 10BCornwall)
[19:14:46] <wikibugs>	 (03PS11) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837)
[19:17:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q1): Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10172825 (10andrea.denisse) 05Open→03Resolved Hi, I've resynced the drive and it's now part of our RAID array:  ` denisse@prometheus1008:~$ sudo mdadm --de...
[19:19:05] <wikibugs>	 (03CR) 10Joal: "One question inline" [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[19:31:17] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] puppetdb: Move JVM config out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[19:40:13] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] es-exporter: add wikifunctions queries [puppet] - 10https://gerrit.wikimedia.org/r/1075284 (https://phabricator.wikimedia.org/T371426) (owner: 10Cwhite)
[19:42:13] <wikibugs>	 (03PS2) 10Samtar: Set `wgMFCustomSiteModules` to false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075306 (https://phabricator.wikimedia.org/T375540) (owner: 10Steven Rawson)
[19:48:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075306 (https://phabricator.wikimedia.org/T375540) (owner: 10Steven Rawson)
[19:48:38] <Izno>	 TheresNoTime, ^
[19:49:26] <TheresNoTime>	 ta! Should probably stop laying on the bed and move towards a laptop
[19:49:47] <Izno>	 priorities
[19:50:04] <TheresNoTime>	 Do you have https://wikitech.wikimedia.org/wiki/WikimediaDebug installed?
[19:50:11] <Izno>	 just did yes
[19:53:23] <TheresNoTime>	 may as well start now
[19:53:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075306 (https://phabricator.wikimedia.org/T375540) (owner: 10Steven Rawson)
[19:54:22] <wikibugs>	 (03Merged) 10jenkins-bot: Set `wgMFCustomSiteModules` to false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075306 (https://phabricator.wikimedia.org/T375540) (owner: 10Steven Rawson)
[19:55:23] <wikibugs>	 (03PS1) 10C. Scott Ananian: Use `class` instead of `id` for scribunto errors [extensions/Scribunto] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075309 (https://phabricator.wikimedia.org/T375539)
[19:55:39] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Scribunto] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075309 (https://phabricator.wikimedia.org/T375539) (owner: 10C. Scott Ananian)
[19:56:38] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1075306|Set `wgMFCustomSiteModules` to false for mediawikiwiki (T375540)]]
[19:56:45] <stashbot>	 T375540: Set wgMFCustomSiteModules to false for mediawikiwiki - https://phabricator.wikimedia.org/T375540
[19:58:37] <logmsgbot>	 !log samtar@deploy1003 samtar, izno: Backport for [[gerrit:1075306|Set `wgMFCustomSiteModules` to false for mediawikiwiki (T375540)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[19:58:59] <TheresNoTime>	 Izno: your patch is now ready for testing — you can use WikimediaDebug to pick any of the test servers and check if the change works as expected on mediawikiwiki
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T2000)
[20:00:05] <jouncebot>	 Krinkle, toyofuku, derenrich, Izno, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:15] <derenrich>	 o/
[20:00:24] <toyofuku>	 hellooo
[20:00:24] <TheresNoTime>	 Izno: given the type of patch, I'm not sure if its actually testable..
[20:00:37] <Izno>	 I think I can test it, I just don't know what exactly I'm doing :)
[20:02:12] <wikibugs>	 (03PS1) 10CDanis: coredns: add support for Service externalIPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075311 (https://phabricator.wikimedia.org/T344171)
[20:02:36] <TheresNoTime>	 Izno: so if you go to, e.g. https://www.mediawiki.org/wiki/MediaWiki:Common.css, and in your browser you should see the WikimediaDebug extension icon (a wikitech globe iirc) — click that, select k8s-mwdebug, and then click the toggle
[20:03:04] <Izno>	 toggle clicked
[20:03:30] <TheresNoTime>	 if you refresh that tab, you're now using a test server where your config patch has been applied
[20:03:47] <Izno>	 ok, cool
[20:04:27] <TheresNoTime>	 (make sure you toggle that off when you're done) — as for testing your config patch, I'm not sure how you'd really do that
[20:04:47] <Izno>	 test here is to add something to Common.css, see if it pops through in console or not
[20:04:51] <Izno>	 while on the mobile domain
[20:05:22] <TheresNoTime>	 ack — let me know when you're confident your patch works, and then I'll continue the sync :)
[20:07:11] <derenrich>	 (just fyi this is my second attempt at my first patch, so not exactly sure what's supposed to be happening here)
[20:08:18] <TheresNoTime>	 derenrich: just handling Izno's patch as I began that one early, then I'll be working through the list at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240924T2000
[20:08:30] <derenrich>	 cool
[20:09:29] <duesen>	 brennen: With the train landing on group0, I would have expected this backport to be live now: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1075269. But that doesn't seem to be the case. At least, I can't observe the changed Vary header in the response. Can you confirm that this patch is in the history of the version that's deployed? Am I missing something?
[20:10:00] <cscott>	 i'm here
[20:10:11] <Izno>	 TheresNoTime, it's functional
[20:10:38] <TheresNoTime>	 duesen: that wasn't deployed I believe — I got a warning that it was on the deployment server but not sync'd
[20:10:42] <cscott>	 duesen: on gerrit there's an "included in" drop down
[20:10:58] <TheresNoTime>	 duesen: it'll be sync'd once I finished Izno's patch afaik
[20:10:59] <duesen>	 cscott: oh nice, thanks!
[20:11:00] <TheresNoTime>	 Izno: ack
[20:11:03] <logmsgbot>	 !log samtar@deploy1003 samtar, izno: Continuing with sync
[20:11:25] <duesen>	 TheresNoTime: ah, thanks! I was afraid that this would happen. Sorry for the mess.
[20:11:32] <cscott>	 Under the 'three dots' menu at the top right https://usercontent.irccloud-cdn.com/file/GiAt4BS1/image.png
[20:12:23] <duesen>	 cscott: that's neat, but not useful in this case - since it actually *is* on the branch, but didn't get deployed :)
[20:12:25] <cscott>	 ^ but as TheresNoTime says, just because it is on the branch doesn't *necessarily* mean that it was deployed :) 
[20:12:32] <cscott>	 yeah.
[20:12:46] <cscott>	 anyway, just sharing some useful gerrit trivia really
[20:15:07] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[20:15:50] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075306|Set `wgMFCustomSiteModules` to false for mediawikiwiki (T375540)]] (duration: 19m 11s)
[20:15:56] <stashbot>	 T375540: Set wgMFCustomSiteModules to false for mediawikiwiki - https://phabricator.wikimedia.org/T375540
[20:15:59] <TheresNoTime>	 Izno: live on production now :)
[20:16:05] <Izno>	 :D
[20:16:08] <TheresNoTime>	 duesen: yours should be too, can you check its working as expected?
[20:16:53] <wikibugs>	 (03PS4) 10Krinkle: Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287
[20:17:02] <TheresNoTime>	 Krinkle: around for your deployment?
[20:18:08] <wikibugs>	 (03PS2) 10Stoyofuku-wmf: Deploy donate link to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585)
[20:18:32] <TheresNoTime>	 toyofuku: will move to yours next, ready? :)
[20:18:37] <toyofuku>	 yep!
[20:19:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf)
[20:20:07] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[20:20:55] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy donate link to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf)
[20:21:17] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1074550|Deploy donate link to all wikis (T373585)]]
[20:21:23] <stashbot>	 T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585
[20:23:19] <logmsgbot>	 !log samtar@deploy1003 samtar, toyofuku: Backport for [[gerrit:1074550|Deploy donate link to all wikis (T373585)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:23:33] <TheresNoTime>	 toyofuku: ready for testing — is this something you can test?
[20:23:41] <toyofuku>	 Yep!  Testing now
[20:23:59] <toyofuku>	 Gonna be a couple minutes as I want to be thorough, but I'll go quick 🤞
[20:24:35] <TheresNoTime>	 no problem! just ping me when you're ready :)
[20:27:01] <toyofuku>	 TheresNoTime: we're looking good, thank you!
[20:27:02] <duesen>	 TheresNoTime: yes, it works!
[20:27:16] <logmsgbot>	 !log samtar@deploy1003 samtar, toyofuku: Continuing with sync
[20:27:31] <TheresNoTime>	 ack :)
[20:32:29] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074550|Deploy donate link to all wikis (T373585)]] (duration: 11m 11s)
[20:32:35] <stashbot>	 T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585
[20:32:53] <TheresNoTime>	 toyofuku: that's live :)
[20:33:06] <wikibugs>	 (03PS2) 10DErenrich: Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311
[20:33:28] <TheresNoTime>	 derenrich: ready for your patch? Did you say this was a retry?
[20:33:46] <derenrich>	 ready. and it never merged due to some issue unrelated to my patch
[20:33:47] <wikibugs>	 (03PS2) 10Bking: admin-ng: add airflow namespaces to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075278 (https://phabricator.wikimedia.org/T374948)
[20:33:47] <toyofuku>	 ty!!!
[20:34:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311 (owner: 10DErenrich)
[20:34:56] <wikibugs>	 (03Merged) 10jenkins-bot: Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311 (owner: 10DErenrich)
[20:35:16] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1074311|Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension]]
[20:35:21] <derenrich>	 does it matter which server i point wikimedia debug to?
[20:35:38] <TheresNoTime>	 derenrich: no, any will work
[20:37:15] <logmsgbot>	 !log samtar@deploy1003 derenrich, samtar: Backport for [[gerrit:1074311|Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:37:27] <TheresNoTime>	 derenrich: ready for testing now
[20:37:36] <derenrich>	 it's working! 
[20:37:45] <TheresNoTime>	 cscott: I'm going to set your patch merging now
[20:37:48] <logmsgbot>	 !log samtar@deploy1003 derenrich, samtar: Continuing with sync
[20:37:52] <cscott>	 ok, thanks!
[20:37:55] * Krinkle is around albeit late
[20:37:58] <wikibugs>	 (03CR) 10Samtar: [C:03+2] Use `class` instead of `id` for scribunto errors [extensions/Scribunto] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075309 (https://phabricator.wikimedia.org/T375539) (owner: 10C. Scott Ananian)
[20:38:40] <Krinkle>	 TheresNoTime: happy to tag join still if possible, otherwise I might roll it out later today 
[20:39:06] <TheresNoTime>	 Krinkle: no worries, will probably be able to get yours out while ^ merges
[20:39:26] <wikibugs>	 (03PS5) 10Krinkle: Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287
[20:39:49] <derenrich>	 thanks for helping! i assume i'm good to leave?
[20:40:36] <TheresNoTime>	 derenrich: Ideally you'd test it again in prod once its finished syncing (a couple more minutes)
[20:40:43] <derenrich>	 ok i can wait 
[20:41:54] <cscott>	 https://test.wikipedia.org/wiki/T375539 is my test page once the backport deploys
[20:42:22] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074311|Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension]] (duration: 07m 05s)
[20:42:29] <TheresNoTime>	 derenrich: live on prod now
[20:42:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 (owner: 10Krinkle)
[20:42:47] <derenrich>	 yup working
[20:42:50] <derenrich>	 thanks so much
[20:42:54] <TheresNoTime>	 No problem! :)
[20:43:04] <Krinkle>	 thx
[20:43:29] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 (owner: 10Krinkle)
[20:43:39] <TheresNoTime>	 cscott: I left starting your merge a little late, sorry — still going to be another 20m or so, is that okay?
[20:43:51] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1071287|Remove unused wgStatsMethod, wgResourceLoaderClientPreferences]]
[20:44:08] <cscott>	 yeah i can deal
[20:44:20] <cscott>	 zuul is very slow today
[20:45:02] <TheresNoTime>	 there's a couple of jobs running for 1-2hrs :/
[20:45:12] <wikibugs>	 (03PS1) 10Krinkle: labs: Remove unused wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075313
[20:45:43] <logmsgbot>	 !log samtar@deploy1003 samtar, krinkle: Backport for [[gerrit:1071287|Remove unused wgStatsMethod, wgResourceLoaderClientPreferences]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:45:48] <cscott>	 there was a job for Experiment:GrowthExperiments which failed one of its parallel tests and then ran for an additional *hour* running the other branches of the parallel test suite despite the entire build being doomed
[20:45:48] <TheresNoTime>	 Krinkle: are you going to want to test your patch, or is "nothing breaking" enough? :D
[20:46:16] <Krinkle>	 I'll run it through mwdebug to check for any surprise phph warnings but other than that no 
[20:46:39] <TheresNoTime>	 ack, lemme know when I can sync :)
[20:47:17] <Krinkle>	 LGTM, go ahead
[20:47:25] <logmsgbot>	 !log samtar@deploy1003 samtar, krinkle: Continuing with sync
[20:47:50] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Jasonb28' 'MichiganNJPat' # T375516
[20:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:56] <stashbot>	 T375516: Unblock stuck global rename of Jasonb28 - https://phabricator.wikimedia.org/T375516
[20:49:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10173104 (10Jclark-ctr) 05Open→03Resolved
[20:52:04] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071287|Remove unused wgStatsMethod, wgResourceLoaderClientPreferences]] (duration: 08m 13s)
[20:52:33] <TheresNoTime>	 Krinkle: want me to do the beta-only 1075313 while I'm waiting?
[20:52:49] <Krinkle>	 sure, feel free to merge.
[20:53:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075313 (owner: 10Krinkle)
[20:54:15] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Remove unused wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075313 (owner: 10Krinkle)
[20:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[20:58:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10173118 (10Jclark-ctr) @jijiki  please do update to preseed.yaml, and site.pp when you can we have received these and can not move forward until that step is completed.
[21:00:22] <cscott>	 TheresNoTime: subbu is going to step in to verify the backport if I'm AFK when jenkins is done
[21:00:48] <TheresNoTime>	 cscott: no worries, there's about 8m left for the merge fwiw
[21:01:27] <subbu>	 (i am here)
[21:03:52] <wikibugs>	 (03Merged) 10jenkins-bot: Use `class` instead of `id` for scribunto errors [extensions/Scribunto] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075309 (https://phabricator.wikimedia.org/T375539) (owner: 10C. Scott Ananian)
[21:04:22] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1075309|Use `class` instead of `id` for scribunto errors (T375539)]]
[21:04:29] <stashbot>	 T375539: Scribunto generates duplicate IDs when there are errors on fragments included more than once on a page - https://phabricator.wikimedia.org/T375539
[21:06:22] <logmsgbot>	 !log samtar@deploy1003 samtar, cscott: Backport for [[gerrit:1075309|Use `class` instead of `id` for scribunto errors (T375539)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:06:34] <TheresNoTime>	 cscott / subbu ^
[21:07:37] <subbu>	 ack
[21:08:20] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up new S3-level replicated storage cluster "apus" - https://phabricator.wikimedia.org/T279621#10173141 (10Scott_French) @MatthewVernon - FYI, while reviewing the logs from the first part of the switchover earlier today, I noticed that `apus` is depooled everywhere,...
[21:09:09] <subbu>	 TheresNoTime, lgtm. (/cc cscott)
[21:09:21] <logmsgbot>	 !log samtar@deploy1003 samtar, cscott: Continuing with sync
[21:11:38] <wikibugs>	 (03PS1) 10Scott French: sre.discovery.datacenter: exclude kibana7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544)
[21:11:38] <wikibugs>	 (03CR) 10Scott French: "Given the the discussion in T375544, I think this seems like the most sensible way forward for now. Thanks in advance for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) (owner: 10Scott French)
[21:13:08] <cscott>	 Thanks subbu!
[21:13:58] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075309|Use `class` instead of `id` for scribunto errors (T375539)]] (duration: 09m 36s)
[21:14:05] <stashbot>	 T375539: Scribunto generates duplicate IDs when there are errors on fragments included more than once on a page - https://phabricator.wikimedia.org/T375539
[21:15:02] <TheresNoTime>	 !log UTC late backport window done
[21:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:11] <wikibugs>	 (03PS4) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486)
[21:15:46] <wikibugs>	 (03CR) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[21:21:22] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for madwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075316 (https://phabricator.wikimedia.org/T374968)
[21:23:06] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Initial configuration for madwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075316 (https://phabricator.wikimedia.org/T374968) (owner: 10Zabe)
[21:24:04] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for madwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075316 (https://phabricator.wikimedia.org/T374968) (owner: 10Zabe)
[21:25:15] <zabe>	 !log create Wiktionary Madurese # T374968
[21:25:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:21] <stashbot>	 T374968: Create Wiktionary Madurese - https://phabricator.wikimedia.org/T374968
[21:25:44] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Creating madwiktionary (T374968)
[21:27:28] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for kgewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075318 (https://phabricator.wikimedia.org/T374813)
[21:29:28] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Initial configuration for kgewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075318 (https://phabricator.wikimedia.org/T374813) (owner: 10Zabe)
[21:30:31] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for kgewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075318 (https://phabricator.wikimedia.org/T374813) (owner: 10Zabe)
[21:30:44] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "Thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) (owner: 10Scott French)
[21:32:34] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Creating madwiktionary (T374968) (duration: 06m 50s)
[21:32:40] <stashbot>	 T374968: Create Wiktionary Madurese - https://phabricator.wikimedia.org/T374968
[21:32:47] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=madwiktionary --cluster=all 2>&1 | tee /tmp/madwiktionary.UpdateSearchIndexConfig.log # T374968
[21:32:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:18] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Creating kgewiki (T374813)
[21:34:24] <stashbot>	 T374813: Create Wikipedia Komering - https://phabricator.wikimedia.org/T374813
[21:35:47] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] varnish: Cast test resources to str [puppet] - 10https://gerrit.wikimedia.org/r/1075300 (owner: 10BCornwall)
[21:40:54] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Creating kgewiki (T374813) (duration: 06m 35s)
[21:41:01] <stashbot>	 T374813: Create Wikipedia Komering - https://phabricator.wikimedia.org/T374813
[21:41:28] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=kgewiki --cluster=all 2>&1 | tee /tmp/kgewiki.UpdateSearchIndexConfig.log # T374813
[21:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:15] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for moswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075320 (https://phabricator.wikimedia.org/T374641)
[21:48:40] <wikibugs>	 (03PS1) 10Bking: airflow: allow traffic to webserver port from dse-k8s pods [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948)
[21:48:58] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Initial configuration for moswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075320 (https://phabricator.wikimedia.org/T374641) (owner: 10Zabe)
[21:49:09] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking)
[21:49:41] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for moswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075320 (https://phabricator.wikimedia.org/T374641) (owner: 10Zabe)
[21:50:28] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Creating moswiki (T374641)
[21:50:34] <stashbot>	 T374641: Create Wikipedia Mooré - https://phabricator.wikimedia.org/T374641
[21:56:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10173348 (10Papaul)
[21:57:17] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Creating moswiki (T374641) (duration: 06m 49s)
[21:57:24] <stashbot>	 T374641: Create Wikipedia Mooré - https://phabricator.wikimedia.org/T374641
[21:57:51] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for gorwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075322 (https://phabricator.wikimedia.org/T375088)
[21:58:16] <zabe>	 zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=moswiki --cluster=all 2>&1 | tee /tmp/moswiki.UpdateSearchIndexConfig.log # T374641
[21:58:21] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=moswiki --cluster=all 2>&1 | tee /tmp/moswiki.UpdateSearchIndexConfig.log # T374641
[21:58:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:36] <wikibugs>	 (03PS2) 10Bking: airflow: allow traffic to webserver port from dse-k8s pods [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948)
[21:58:45] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Initial configuration for gorwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075322 (https://phabricator.wikimedia.org/T375088) (owner: 10Zabe)
[21:58:47] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking)
[21:59:30] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for gorwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075322 (https://phabricator.wikimedia.org/T375088) (owner: 10Zabe)
[22:00:06] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Creating gorwikiquote (T375088)
[22:00:23] <stashbot>	 T375088: Create Wikiquote Gorontalo - https://phabricator.wikimedia.org/T375088
[22:03:11] <Reedy>	 bye wikibugs
[22:06:58] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Initial configuration for shnwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075323 (https://phabricator.wikimedia.org/T375430) (owner: 10Zabe)
[22:07:02] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Creating gorwikiquote (T375088) (duration: 06m 56s)
[22:07:08] <stashbot>	 T375088: Create Wikiquote Gorontalo - https://phabricator.wikimedia.org/T375088
[22:07:23] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=gorwikiquote --cluster=all 2>&1 | tee /tmp/gorwikiquote.UpdateSearchIndexConfig.log # T375088
[22:07:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:43] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for shnwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075323 (https://phabricator.wikimedia.org/T375430) (owner: 10Zabe)
[22:08:43] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Creating shnwikinews (T375430)
[22:08:49] <stashbot>	 T375430: Create Wikinews Shan - https://phabricator.wikimedia.org/T375430
[22:09:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:14:42] <wikibugs>	 (03PS1) 10BCornwall: Remove rsa-2048 certs from services [puppet] - 10https://gerrit.wikimedia.org/r/1075326 (https://phabricator.wikimedia.org/T375569)
[22:15:05] <wikibugs>	 (03CR) 10BCornwall: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075326 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall)
[22:15:30] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Creating shnwikinews (T375430) (duration: 06m 47s)
[22:15:34] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=shnwikinews --cluster=all 2>&1 | tee /tmp/shnwikinews.UpdateSearchIndexConfig.log # T375430
[22:15:37] <stashbot>	 T375430: Create Wikinews Shan - https://phabricator.wikimedia.org/T375430
[22:15:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:43] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075327
[22:16:44] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075327 (owner: 10Zabe)
[22:17:24] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075327 (owner: 10Zabe)
[22:20:31] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: update interwiki cache
[22:27:23] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: update interwiki cache (duration: 06m 52s)
[22:33:39] <zabe>	 !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=elwiki # T363538
[22:33:42] <zabe>	 !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=cawiki # T363538
[22:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:45] <stashbot>	 T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538
[22:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:16] <zabe>	 !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=eowikinews # T363538
[22:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:31] <zabe>	 !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=foundationwiki # T363538
[22:34:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:50] <zabe>	 !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=gurwiki # T363538
[22:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:35:06] <zabe>	 !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=metawiki # T363538
[22:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:44] <zabe>	 !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=orwiki # T363538
[22:41:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:53] <stashbot>	 T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538
[22:42:03] <zabe>	 !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=pawiki # T363538
[22:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:16] <zabe>	 !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=testwiki # T363538
[22:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:35] <zabe>	 !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=tumwiki # T363538
[22:42:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:50] <zabe>	 !log zabe@mwmaint1002:~/T363538$ mwscript cleanupTitles.php --wiki=viwiki # T363538
[22:42:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:34:38] <wikibugs>	 (03PS1) 10DErenrich: Bump coverage of the add-a-fact quicksurvey to 0.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075333
[23:36:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 808.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[23:38:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1075334
[23:38:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1075334 (owner: 10TrainBranchBot)
[23:39:01] <wikibugs>	 (03PS1) 10BPirkle: REST: Adjust REST Sandbox spec for new specs module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075335 (https://phabricator.wikimedia.org/T375512)
[23:41:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 815.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded