[00:08:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:31] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 619.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:54:49] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 1347 MB (1% inode=98%): /tmp 1347 MB (1% inode=98%): /var/tmp 1347 MB (1% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [02:08:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T384592)', diff saved to https://phabricator.wikimedia.org/P73047 and previous config saved to /var/cache/conftool/dbconfig/20250203-020900-marostegui.json [02:09:04] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [02:12:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:24:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P73048 and previous config saved to /var/cache/conftool/dbconfig/20250203-022407-marostegui.json [02:34:31] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P73049 and previous config saved to /var/cache/conftool/dbconfig/20250203-023914-marostegui.json [02:54:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T384592)', diff saved to https://phabricator.wikimedia.org/P73050 and previous config saved to /var/cache/conftool/dbconfig/20250203-025421-marostegui.json [02:54:24] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [02:54:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance [02:54:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T384592)', diff saved to https://phabricator.wikimedia.org/P73051 and previous config saved to /var/cache/conftool/dbconfig/20250203-025443-marostegui.json [03:07:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:16:29] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [06:12:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:05] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:24:41] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:29] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [07:49:05] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:41] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:00:05] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:12:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:30:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:31:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [08:32:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [08:34:35] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [08:35:36] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [08:36:37] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [08:37:03] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [08:37:15] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [08:37:35] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [09:06:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1182 T385084', diff saved to https://phabricator.wikimedia.org/P73052 and previous config saved to /var/cache/conftool/dbconfig/20250203-090558-marostegui.json [09:06:02] T385084: Upgrade and rebuild s2 - https://phabricator.wikimedia.org/T385084 [09:06:19] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1182.eqiad.wmnet [09:07:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Index rebuild + upgrade [09:12:56] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1182.eqiad.wmnet [09:13:33] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Index rebuild [09:14:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2031 to es2 codfw master dbtmaint T376905', diff saved to https://phabricator.wikimedia.org/P73053 and previous config saved to /var/cache/conftool/dbconfig/20250203-091450-root.json [09:17:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T384592)', diff saved to https://phabricator.wikimedia.org/P73054 and previous config saved to /var/cache/conftool/dbconfig/20250203-091700-marostegui.json [09:17:03] !log marostegui@dns1006 START - running authdns-update [09:17:42] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:18:56] !log marostegui@dns1006 END - running authdns-update [09:22:34] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Index rebuild [09:22:42] RESOLVED: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:25:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on es2026.codfw.wmnet with reason: Kernel reboot [09:26:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2026 for kernel reboot', diff saved to https://phabricator.wikimedia.org/P73055 and previous config saved to /var/cache/conftool/dbconfig/20250203-092559-marostegui.json [09:26:27] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2026.codfw.wmnet [09:29:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on es2037.codfw.wmnet with reason: Kernel reboot [09:29:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2037 for kernel reboot', diff saved to https://phabricator.wikimedia.org/P73057 and previous config saved to /var/cache/conftool/dbconfig/20250203-092928-marostegui.json [09:29:40] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2037.codfw.wmnet [09:32:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P73058 and previous config saved to /var/cache/conftool/dbconfig/20250203-093207-marostegui.json [09:35:15] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2037.codfw.wmnet [09:36:09] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2026.codfw.wmnet [09:36:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73059 and previous config saved to /var/cache/conftool/dbconfig/20250203-093613-root.json [09:36:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73060 and previous config saved to /var/cache/conftool/dbconfig/20250203-093628-root.json [09:40:31] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 10409MiB (2% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [09:47:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P73061 and previous config saved to /var/cache/conftool/dbconfig/20250203-094714-marostegui.json [09:51:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73062 and previous config saved to /var/cache/conftool/dbconfig/20250203-095118-root.json [09:51:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73063 and previous config saved to /var/cache/conftool/dbconfig/20250203-095133-root.json [10:00:31] RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [10:02:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T384592)', diff saved to https://phabricator.wikimedia.org/P73064 and previous config saved to /var/cache/conftool/dbconfig/20250203-100221-marostegui.json [10:02:24] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [10:02:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance [10:02:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:03:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T384592)', diff saved to https://phabricator.wikimedia.org/P73065 and previous config saved to /var/cache/conftool/dbconfig/20250203-100300-marostegui.json [10:06:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73066 and previous config saved to /var/cache/conftool/dbconfig/20250203-100623-root.json [10:06:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73067 and previous config saved to /var/cache/conftool/dbconfig/20250203-100638-root.json [10:21:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73068 and previous config saved to /var/cache/conftool/dbconfig/20250203-102129-root.json [10:21:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73069 and previous config saved to /var/cache/conftool/dbconfig/20250203-102144-root.json [10:27:08] FIRING: ProbeDown: Service restbase2035-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase2035-b:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:32] FIRING: [2x] ProbeDown: Service restbase2035-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:53] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:31:53] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:36:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73070 and previous config saved to /var/cache/conftool/dbconfig/20250203-103634-root.json [10:36:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73071 and previous config saved to /var/cache/conftool/dbconfig/20250203-103649-root.json [10:39:32] RESOLVED: [2x] ProbeDown: Service restbase2035-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:50:31] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 3093MiB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [10:59:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2034 to es3 codfw master dbtmaint T376905', diff saved to https://phabricator.wikimedia.org/P73072 and previous config saved to /var/cache/conftool/dbconfig/20250203-105915-root.json [10:59:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2027 for kernel reboot', diff saved to https://phabricator.wikimedia.org/P73073 and previous config saved to /var/cache/conftool/dbconfig/20250203-105935-marostegui.json [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1100) [11:00:35] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2027.codfw.wmnet [11:02:17] !log marostegui@dns1006 START - running authdns-update [11:04:08] !log marostegui@dns1006 END - running authdns-update [11:10:33] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2027.codfw.wmnet [11:10:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73074 and previous config saved to /var/cache/conftool/dbconfig/20250203-111052-root.json [11:23:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1155.eqiad.wmnet with reason: Kernel reboot [11:24:05] !log Reboot and upgrade db1155 [11:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73075 and previous config saved to /var/cache/conftool/dbconfig/20250203-112558-root.json [11:27:11] PROBLEM - MariaDB Replica IO: s7 on clouddb1018 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3317 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:27:21] PROBLEM - MariaDB Replica IO: s7 on clouddb1014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3317 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:27:21] PROBLEM - MariaDB Replica IO: s2 on clouddb1014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3312 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:27:25] PROBLEM - MariaDB Replica IO: s2 on clouddb1018 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3312 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:27:34] ^ downtiming [11:27:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb1018.eqiad.wmnet with reason: Kernel reboot [11:28:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb1014.eqiad.wmnet with reason: Kernel reboot [11:35:11] RECOVERY - MariaDB Replica IO: s7 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:35:21] RECOVERY - MariaDB Replica IO: s2 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:35:21] RECOVERY - MariaDB Replica IO: s7 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:35:25] RECOVERY - MariaDB Replica IO: s2 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:41:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73076 and previous config saved to /var/cache/conftool/dbconfig/20250203-114103-root.json [11:56:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer) [11:56:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73077 and previous config saved to /var/cache/conftool/dbconfig/20250203-115608-root.json [11:57:33] (03CR) 10Dreamy Jazz: [C:03+1] jobqueue: bump ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115899 (https://phabricator.wikimedia.org/T385273) (owner: 10Hnowlan) [11:59:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:10:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115791 (https://phabricator.wikimedia.org/T378527) (owner: 10Urbanecm) [12:11:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73078 and previous config saved to /var/cache/conftool/dbconfig/20250203-121113-root.json [12:11:46] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Various fixes for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115453 (owner: 10Clément Goubert) [12:12:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime) [12:13:51] (03Merged) 10jenkins-bot: mediawiki: Various fixes for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115453 (owner: 10Clément Goubert) [12:17:25] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116778 [12:19:00] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [12:19:03] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [12:19:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:22:29] (03PS2) 10Clément Goubert: mw-cron: Fix test job name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116781 [12:24:01] (03CR) 10Clément Goubert: [C:03+2] mw-cron: Fix test job name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116781 (owner: 10Clément Goubert) [12:25:02] (03Merged) 10jenkins-bot: mw-cron: Fix test job name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116781 (owner: 10Clément Goubert) [12:25:12] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [12:25:15] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [12:28:19] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10516521 (10ayounsi) Looks all good to me ! First start with non-paging, and revisit later on. I'm wondering if we could re-wri... [12:39:28] (03PS1) 10Clément Goubert: mw_releases: Add mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1116782 (https://phabricator.wikimedia.org/T341555) [12:39:57] (03PS2) 10Clément Goubert: mw_releases: Add mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1116782 (https://phabricator.wikimedia.org/T377962) [12:41:36] (03CR) 10Clément Goubert: [C:03+2] mw_releases: Add mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1116782 (https://phabricator.wikimedia.org/T377962) (owner: 10Clément Goubert) [12:42:34] 06SRE, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10516577 (10Ahonc) [12:46:40] jouncebot: nowandnext [12:46:41] No deployments scheduled for the next 1 hour(s) and 13 minute(s) [12:46:41] In 1 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1400) [12:47:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73079 and previous config saved to /var/cache/conftool/dbconfig/20250203-124721-root.json [12:47:26] (03PS1) 10Reedy: Add missing array_values for PHP 7 compatibility [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116783 (https://phabricator.wikimedia.org/T385255) [12:47:31] (03CR) 10Reedy: [C:03+2] Add missing array_values for PHP 7 compatibility [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116783 (https://phabricator.wikimedia.org/T385255) (owner: 10Reedy) [12:47:59] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:48:37] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:50:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.010 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:10] !log cgoubert@deploy2002 Started scap sync-world: Testing scap deployment of mw-cron [12:51:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:39] Reedy: a minute, doing a thing [12:52:02] claime: CI will take a while anyway. No rush on my part to actually pull/deploy ) [12:52:04] :) [12:52:09] cool :) [12:52:36] just lining up some stuff to reduce the php8.1 logspam further [12:53:15] !log cgoubert@deploy2002 Finished scap sync-world: Testing scap deployment of mw-cron (duration: 02m 46s) [12:53:30] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [12:53:32] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [12:55:26] (03PS1) 10Reedy: SpecialMathWikibase: Null-coalescence getDescription() call [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116784 (https://phabricator.wikimedia.org/T385170) [12:55:31] (03CR) 10Reedy: [C:03+2] SpecialMathWikibase: Null-coalescence getDescription() call [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116784 (https://phabricator.wikimedia.org/T385170) (owner: 10Reedy) [12:55:46] !log cgoubert@deploy2002 Started scap sync-world: Rebuild image and release file for mw-cron [12:55:56] (03PS1) 10Reedy: SpecialMathWikibase: Null-coalescence $par [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116785 (https://phabricator.wikimedia.org/T385269) [12:56:01] (03CR) 10Reedy: [C:03+2] SpecialMathWikibase: Null-coalescence $par [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116785 (https://phabricator.wikimedia.org/T385269) (owner: 10Reedy) [12:56:34] (03Merged) 10jenkins-bot: Add missing array_values for PHP 7 compatibility [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116783 (https://phabricator.wikimedia.org/T385255) (owner: 10Reedy) [12:59:40] (03PS1) 10Reedy: ApiQueryContentTranslationSuggestions: Set default value for to and from parameters [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116788 (https://phabricator.wikimedia.org/T385267) [12:59:46] (03CR) 10Reedy: [C:03+2] ApiQueryContentTranslationSuggestions: Set default value for to and from parameters [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116788 (https://phabricator.wikimedia.org/T385267) (owner: 10Reedy) [13:01:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 [13:01:24] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1116789 (https://phabricator.wikimedia.org/T385457) [13:01:29] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1116790 (https://phabricator.wikimedia.org/T385457) [13:02:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73080 and previous config saved to /var/cache/conftool/dbconfig/20250203-130226-root.json [13:02:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2205 with weight 0 T385457', diff saved to https://phabricator.wikimedia.org/P73081 and previous config saved to /var/cache/conftool/dbconfig/20250203-130248-root.json [13:02:51] T385457: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T385457 [13:03:57] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1116789 (https://phabricator.wikimedia.org/T385457) (owner: 10Gerrit maintenance bot) [13:06:31] !log Emergency s3 switchover T385457 [13:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:04] !log cgoubert@deploy2002 Stopping before sync operations [13:07:10] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [13:07:13] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [13:07:18] what's the puppet failure about? [13:07:28] (03Merged) 10jenkins-bot: SpecialMathWikibase: Null-coalescence getDescription() call [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116784 (https://phabricator.wikimedia.org/T385170) (owner: 10Reedy) [13:08:27] Reedy: all yours [13:08:42] Please do not deploy things now [13:08:42] (03Merged) 10jenkins-bot: SpecialMathWikibase: Null-coalescence $par [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116785 (https://phabricator.wikimedia.org/T385269) (owner: 10Reedy) [13:08:49] marostegui: ack [13:08:59] want me to put a scap lock? [13:09:07] yes please [13:09:12] ack [13:09:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:09:36] !log cgoubert@deploy2002 Locking from deployment [MediaWiki]: Emergency s3 switchover T385457 [13:09:38] T385457: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T385457 [13:10:43] So a change caused wmcs and insetup puppet failures around 11:52 [13:11:32] (03Merged) 10jenkins-bot: ApiQueryContentTranslationSuggestions: Set default value for to and from parameters [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116788 (https://phabricator.wikimedia.org/T385267) (owner: 10Reedy) [13:12:24] but I don't see any relevant change [13:14:36] !log jebe@deploy2002 Started deploy [airflow-dags/analytics_product@ce1f0f6]: (no justification provided) [13:14:41] Error: /Stage[main]/Prometheus::Node_kernel_messages/File[/etc/prometheus-node-kernel-messages-ignore-regex.txt]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/prometheus/prometheus-node-kernel-messages-ignore-regex.txt [13:14:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s3 codfw as read-only for maintenance - T385457', diff saved to https://phabricator.wikimedia.org/P73082 and previous config saved to /var/cache/conftool/dbconfig/20250203-131452-root.json [13:14:55] T385457: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T385457 [13:15:10] !log jebe@deploy2002 Finished deploy [airflow-dags/analytics_product@ce1f0f6]: (no justification provided) (duration: 00m 36s) [13:15:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2205 to s3 primary and set section read-write T385457', diff saved to https://phabricator.wikimedia.org/P73083 and previous config saved to /var/cache/conftool/dbconfig/20250203-131542-root.json [13:16:00] ok, found it, must be 7ca645dbb arturo [13:16:12] (03Abandoned) 10Jbond: netbox: update netbox service definition so it pages [puppet] - 10https://gerrit.wikimedia.org/r/808197 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:16:20] jynus: sending a fix [13:16:26] ok, no prob [13:16:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2209 T385457', diff saved to https://phabricator.wikimedia.org/P73084 and previous config saved to /var/cache/conftool/dbconfig/20250203-131631-marostegui.json [13:16:36] claime: we can deploy again [13:16:54] (03Abandoned) 10Jbond: POC: P:thanos::swift::frontend: move ring manager config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/773794 (owner: 10Jbond) [13:17:12] !log cgoubert@deploy2002 Unlocked for deployment [MediaWiki]: Emergency s3 switchover T385457 (duration: 07m 36s) [13:17:21] (03PS1) 10Lucas Werkmeister: Enable $wgAllowAuthenticatedCrossOrigin on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) [13:17:22] cool, lock lifted, thanks [13:17:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73085 and previous config saved to /var/cache/conftool/dbconfig/20250203-131732-root.json [13:18:26] (03Abandoned) 10Jbond: R:system::role: colour system role based on its name [puppet] - 10https://gerrit.wikimedia.org/r/849497 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [13:18:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [13:19:13] (03CR) 10Lucas Werkmeister: Enable $wgAllowAuthenticatedCrossOrigin on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [13:19:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:20:43] arturo: I am merging a few automated phab tickets into 1 [13:21:06] jynus: sure [13:21:07] (03PS1) 10Arturo Borrero Gonzalez: prometheus: node_kernel_messages: fix ignore regex file path [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960) [13:21:25] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [13:21:28] I will use ^ as the canonical one [13:21:37] (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1116790 (https://phabricator.wikimedia.org/T385457) (owner: 10Gerrit maintenance bot) [13:22:01] !log marostegui@dns1006 START - running authdns-update [13:22:20] (03PS2) 10Arturo Borrero Gonzalez: prometheus: node_kernel_messages: fix ignore regex file path [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960) [13:22:27] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [13:23:13] (03PS1) 10Marostegui: db2209: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1116798 (https://phabricator.wikimedia.org/T385457) [13:23:34] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2209.codfw.wmnet [13:23:45] (03PS3) 10Arturo Borrero Gonzalez: prometheus: node_kernel_messages: fix ignore regex file path [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960) [13:23:46] (03CR) 10Marostegui: [C:03+2] db2209: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1116798 (https://phabricator.wikimedia.org/T385457) (owner: 10Marostegui) [13:23:52] !log marostegui@dns1006 END - running authdns-update [13:24:32] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [13:27:08] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1116783|Add missing array_values for PHP 7 compatibility (T385255)]], [[gerrit:1116784|SpecialMathWikibase: Null-coalescence getDescription() call (T385170)]], [[gerrit:1116785|SpecialMathWikibase: Null-coalescence $par (T385269)]], [[gerrit:1116788|ApiQueryContentTranslationSuggestions: Set default value for to and from parameters (T385267)]] [13:27:16] T385255: Error: Cannot unpack array with string keys - https://phabricator.wikimedia.org/T385255 [13:27:16] T385170: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T385170 [13:27:17] T385269: PHP Deprecated: str_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385269 [13:27:17] T385267: PHP Deprecated: str_replace(): Passing null to parameter #2 ($replace) of type array|string is deprecated - https://phabricator.wikimedia.org/T385267 [13:27:43] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2209.codfw.wmnet [13:28:01] 06SRE, 10MW-on-K8s, 06serviceops: mwgrep cannot be used from a deployment host - https://phabricator.wikimedia.org/T384764#10516923 (10jijiki) p:05Triage→03Medium [13:28:30] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2209.codfw.wmnet with reason: Index rebuild [13:29:02] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus: node_kernel_messages: fix ignore regex file path [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [13:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10516942 (10phaultfinder) [13:30:31] (03Abandoned) 10Jbond: wmflib: add new functions to update a hash with randome secrets [puppet] - 10https://gerrit.wikimedia.org/r/841479 (owner: 10Jbond) [13:32:16] (03CR) 10Marostegui: [C:03+1] dbbackups: Fix dump grants for backup sources and m1 [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [13:32:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73087 and previous config saved to /var/cache/conftool/dbconfig/20250203-133237-root.json [13:33:06] (03Abandoned) 10Jbond: installserver: add spec test for role [puppet] - 10https://gerrit.wikimedia.org/r/980375 (owner: 10Jbond) [13:34:08] !log reedy@deploy2002 reedy: Backport for [[gerrit:1116783|Add missing array_values for PHP 7 compatibility (T385255)]], [[gerrit:1116784|SpecialMathWikibase: Null-coalescence getDescription() call (T385170)]], [[gerrit:1116785|SpecialMathWikibase: Null-coalescence $par (T385269)]], [[gerrit:1116788|ApiQueryContentTranslationSuggestions: Set default value for to and from parameters (T385267)]] synced to the testservers (h [13:34:08] ttps://wikitech.wikimedia.org/wiki/Mwdebug) [13:34:13] T385255: Error: Cannot unpack array with string keys - https://phabricator.wikimedia.org/T385255 [13:34:14] T385170: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T385170 [13:34:14] T385269: PHP Deprecated: str_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385269 [13:34:14] T385267: PHP Deprecated: str_replace(): Passing null to parameter #2 ($replace) of type array|string is deprecated - https://phabricator.wikimedia.org/T385267 [13:34:17] !log reedy@deploy2002 reedy: Continuing with sync [13:35:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116543 (https://phabricator.wikimedia.org/T385205) (owner: 10DLynch) [13:40:05] (03PS1) 10Clément Goubert: mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800 [13:41:05] (03PS2) 10Clément Goubert: mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800 [13:43:52] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1116783|Add missing array_values for PHP 7 compatibility (T385255)]], [[gerrit:1116784|SpecialMathWikibase: Null-coalescence getDescription() call (T385170)]], [[gerrit:1116785|SpecialMathWikibase: Null-coalescence $par (T385269)]], [[gerrit:1116788|ApiQueryContentTranslationSuggestions: Set default value for to and from parameters (T385267)]] (duration [13:43:52] : 16m 43s) [13:43:58] T385255: Error: Cannot unpack array with string keys - https://phabricator.wikimedia.org/T385255 [13:43:58] T385170: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T385170 [13:43:58] T385269: PHP Deprecated: str_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385269 [13:43:59] T385267: PHP Deprecated: str_replace(): Passing null to parameter #2 ($replace) of type array|string is deprecated - https://phabricator.wikimedia.org/T385267 [13:47:02] (03PS1) 10Arturo Borrero Gonzalez: promethes: node_kernel_messages: fix another typo in source file name [puppet] - 10https://gerrit.wikimedia.org/r/1116802 (https://phabricator.wikimedia.org/T380960) [13:47:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73088 and previous config saved to /var/cache/conftool/dbconfig/20250203-134742-root.json [13:47:54] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] promethes: node_kernel_messages: fix another typo in source file name [puppet] - 10https://gerrit.wikimedia.org/r/1116802 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [13:50:03] (03CR) 10Gergő Tisza: [C:03+1] Enable $wgAllowAuthenticatedCrossOrigin on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [13:51:21] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10517106 (10fgiunchedi) >>! In T384731#10511648, @cmooney wrote: >>>! In T384731#10511163, @fgiunchedi wrote: >> Ye... [13:54:42] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:55:50] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10517133 (10fgiunchedi) SGTM too, re: extracting hostname from interface description we could do it via regexp if the extraction/... [13:57:22] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [13:59:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1400). [14:00:04] DreamRimmer and kemayo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:20] o/ [14:00:20] /o [14:00:41] o/ [14:01:24] I can deploy today! [14:02:49] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Change "$wgUploadMissingFileUrl" for svwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer) [14:03:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer) [14:03:05] let’s start with DreamRimmer then [14:04:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:05:16] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Change "$wgUploadMissingFileUrl" for svwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer) [14:05:25] (03Merged) 10jenkins-bot: Change "$wgUploadMissingFileUrl" for svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer) [14:05:39] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1114663|Change "$wgUploadMissingFileUrl" for svwiktionary (T383452)]] [14:05:42] T383452: Edit "$wgUploadMissingFileUrl" for sv wiktionary - https://phabricator.wikimedia.org/T383452 [14:11:22] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dreamrimmer: Backport for [[gerrit:1114663|Change "$wgUploadMissingFileUrl" for svwiktionary (T383452)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:11:24] T383452: Edit "$wgUploadMissingFileUrl" for sv wiktionary - https://phabricator.wikimedia.org/T383452 [14:11:28] DreamRimmer: please test :) [14:11:34] doing [14:13:47] looks good [14:13:51] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dreamrimmer: Continuing with sync [14:13:53] \o/ [14:14:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:15:41] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10517235 (10Jhancock.wm) [14:17:18] (03CR) 10Bartosz Dziewoński: "Absolutely, feel free to deploy whenever you have time. I tested manually on the beta cluster and the httpbb tests should cover the rest. " [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [14:19:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1188 T385084', diff saved to https://phabricator.wikimedia.org/P73091 and previous config saved to /var/cache/conftool/dbconfig/20250203-141939-marostegui.json [14:19:42] T385084: Upgrade and rebuild s2 - https://phabricator.wikimedia.org/T385084 [14:19:46] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1188.eqiad.wmnet [14:20:22] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114663|Change "$wgUploadMissingFileUrl" for svwiktionary (T383452)]] (duration: 14m 42s) [14:20:24] T383452: Edit "$wgUploadMissingFileUrl" for sv wiktionary - https://phabricator.wikimedia.org/T383452 [14:21:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116543 (https://phabricator.wikimedia.org/T385205) (owner: 10DLynch) [14:22:12] (03Merged) 10jenkins-bot: Enable VisualEditor EditCheck on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116543 (https://phabricator.wikimedia.org/T385205) (owner: 10DLynch) [14:22:27] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1116543|Enable VisualEditor EditCheck on dewiki (T385205)]] [14:22:30] T385205: [config] Enable Edit Check (References) for all newcomers at de.wiki - https://phabricator.wikimedia.org/T385205 [14:26:03] !log lucaswerkmeister-wmde@deploy2002 kemayo, lucaswerkmeister-wmde: Backport for [[gerrit:1116543|Enable VisualEditor EditCheck on dewiki (T385205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:26:11] Kemayo: please test :) [14:26:20] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1188.eqiad.wmnet [14:26:25] Lucas_WMDE: It looks good. [14:26:31] ok, thanks! [14:26:44] !log lucaswerkmeister-wmde@deploy2002 kemayo, lucaswerkmeister-wmde: Continuing with sync [14:26:49] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Index rebuild [14:30:31] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 1487MiB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [14:33:11] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1116543|Enable VisualEditor EditCheck on dewiki (T385205)]] (duration: 10m 43s) [14:33:14] T385205: [config] Enable Edit Check (References) for all newcomers at de.wiki - https://phabricator.wikimedia.org/T385205 [14:34:58] (03PS1) 10Lucas Werkmeister (WMDE): Remove /pt from ptwikibooks $wgUploadMissingFileUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116812 [14:35:17] since we have a bit of time left in the window – anyone want to +1 ^ so I can deploy it? :) [14:35:26] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Change "$wgUploadMissingFileUrl" for svwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:47] eh, let’s call the window done and see if anyone reviews that config change later [14:42:53] !log UTC afternoon backport+config window done [14:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:22] (03CR) 10Lucas Werkmeister: Enable $wgAllowAuthenticatedCrossOrigin on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [14:57:19] 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10517441 (10Marostegui) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:56] (03PS1) 10Andrew Bogott: Revert "Revert "Horizon: update release version for codfw1dev"" [puppet] - 10https://gerrit.wikimedia.org/r/1116816 [15:11:29] (03CR) 10Andrew Bogott: [C:03+2] Revert "Revert "Horizon: update release version for codfw1dev"" [puppet] - 10https://gerrit.wikimedia.org/r/1116816 (owner: 10Andrew Bogott) [15:12:01] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485 (10RobH) 03NEW [15:12:14] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10517536 (10RobH) [15:14:02] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10517540 (10RobH) a:03BTullis @BTullis Please review the above task description and checklist to ensure it covers all the hostnames for upgrade to the 8TB HDD and t... [15:20:24] 06SRE, 10Wikimedia-Etherpad, 07SecTeam-Processed, 07Security: Deletion of etherpad - https://phabricator.wikimedia.org/T385356#10517566 (10sbassett) [15:20:29] 06SRE, 10Wikimedia-Etherpad, 07SecTeam-Processed, 07Security: Deletion of etherpad - https://phabricator.wikimedia.org/T385356#10517567 (10sbassett) p:05Triage→03Low [15:21:22] 06SRE, 10Wikimedia-Etherpad, 07SecTeam-Processed, 07Security: Delete etherpad "thispadisnotsecureJustTesting" - https://phabricator.wikimedia.org/T385356#10517572 (10sbassett) [15:23:22] (03PS1) 10Volans: spicerack: extend run_cookbook() accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1116818 [15:30:31] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 1472MiB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [15:32:28] (03PS1) 10CDanis: tunnelencabulator: add reqctl & bump [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1116821 (https://phabricator.wikimedia.org/T382269) [15:33:07] (03CR) 10Volans: [C:03+1] "Nice!" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1116821 (https://phabricator.wikimedia.org/T382269) (owner: 10CDanis) [15:34:45] (03CR) 10CDanis: [V:03+2 C:03+2] tunnelencabulator: add reqctl & bump [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1116821 (https://phabricator.wikimedia.org/T382269) (owner: 10CDanis) [15:34:56] (03PS2) 10Herron: add aux-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100153 (https://phabricator.wikimedia.org/T381417) [15:37:26] (03CR) 10CDanis: [C:03+1] add aux-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100153 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [15:37:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db1169.eqiad.wmnet T385141', diff saved to https://phabricator.wikimedia.org/P73093 and previous config saved to /var/cache/conftool/dbconfig/20250203-153755-fceratto.json [15:37:59] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [15:40:01] (03PS1) 10Arturo Borrero Gonzalez: prometheus-node-kernel-messages.sh: don't fail if there are no matches [puppet] - 10https://gerrit.wikimedia.org/r/1116822 (https://phabricator.wikimedia.org/T380960) [15:40:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1169.eqiad.wmnet with reason: provisioning - T385141 [15:41:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1251.eqiad.wmnet with reason: provisioning - T385141 [15:43:45] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus-node-kernel-messages.sh: don't fail if there are no matches [puppet] - 10https://gerrit.wikimedia.org/r/1116822 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [15:45:22] (03PS2) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) [15:45:48] (03CR) 10Volans: k8s.pool-depool-node: Add support to downtime/remove downtime (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [15:46:37] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [15:46:37] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [15:48:07] (03CR) 10Herron: [C:03+2] add aux-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100153 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [15:49:38] (03CR) 10Volans: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:52:13] (03PS1) 10Sohom Datta: Fix regression with re-enabling button after error [extensions/PageTriage] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116824 (https://phabricator.wikimedia.org/T385355) [15:52:20] (03Merged) 10jenkins-bot: add aux-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100153 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [15:53:19] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1116825 [15:53:23] (03PS1) 10Elukey: admin_ng: set new Docker images for Knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116826 (https://phabricator.wikimedia.org/T369493) [15:54:42] (03CR) 10Volans: "Thanks for the patch, couple of considerations inline." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1115767 (owner: 10JMeybohm) [15:54:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/PageTriage] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116824 (https://phabricator.wikimedia.org/T385355) (owner: 10Sohom Datta) [15:55:46] (03PS1) 10Federico Ceratto: instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) [15:56:44] (03CR) 10Fabfur: [C:03+1] varnish: x-analytics: Authorization header summary [puppet] - 10https://gerrit.wikimedia.org/r/1111695 (owner: 10CDanis) [15:56:46] (03CR) 10Marostegui: [C:04-1] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [15:58:41] (03CR) 10Marostegui: [C:04-1] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [15:59:22] (03CR) 10Marostegui: [C:04-1] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [16:04:07] (03PS2) 10Federico Ceratto: instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) [16:04:20] (03PS1) 10Scott French: php8.1: rebuild to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225) [16:04:36] (03CR) 10Marostegui: [C:03+1] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [16:05:10] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [16:12:58] (03PS6) 10Jcrespo: dbbackups: Fix dump grants for backup sources and m1 [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) [16:12:58] (03PS1) 10Jcrespo: dbbackups: Update grants for x1 dump sections too [puppet] - 10https://gerrit.wikimedia.org/r/1116831 (https://phabricator.wikimedia.org/T376916) [16:21:14] 06SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701#10517780 (10Nemoralis) >>! In T336701#9383499, @TheresNoTime wrote: > Atlassian statuspage has [[ https://support.atlassian.com/statuspage/docs/enable-webhook-notifications/ | webhook support ]].. that... [16:25:37] (03PS7) 10Jcrespo: dbbackups: Fix dump grants for backup sources and m1 [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) [16:27:00] (03PS3) 10Herron: aux_k8s: apply etcd_aux_k8s role to aux-k8s-etcd200[345] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1116825 (https://phabricator.wikimedia.org/T381417) [16:28:38] (03CR) 10CDanis: [C:03+1] aux_k8s: apply etcd_aux_k8s role to aux-k8s-etcd200[345] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1116825 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:30:05] jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1630) [16:34:14] (03PS3) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) [16:34:34] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10517875 (10phaultfinder) [16:34:45] (03PS2) 10Elukey: admin_ng: set new Docker images for Knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116826 (https://phabricator.wikimedia.org/T369493) [16:34:45] (03PS1) 10Elukey: kartotherian: update Docker image and geoshapes yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) [16:36:21] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [16:37:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Add db1251.eqiad.wmnet T385141', diff saved to https://phabricator.wikimedia.org/P73096 and previous config saved to /var/cache/conftool/dbconfig/20250203-163722-fceratto.json [16:37:25] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [16:37:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73097 and previous config saved to /var/cache/conftool/dbconfig/20250203-163727-root.json [16:37:41] (03PS3) 10Arturo Borrero Gonzalez: cloudgw1003: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) [16:38:10] (03CR) 10Jgiannelos: [C:04-1] kartotherian: update Docker image and geoshapes yaml config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey) [16:38:43] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:39:03] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:40:10] (03CR) 10Elukey: kartotherian: update Docker image and geoshapes yaml config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey) [16:41:11] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:41:57] (03CR) 10Jgiannelos: [C:04-1] kartotherian: update Docker image and geoshapes yaml config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey) [16:42:08] (03CR) 10Kamila Součková: [C:03+1] php8.1: rebuild to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225) (owner: 10Scott French) [16:42:40] (03PS1) 10Daimona Eaytoy: core-Permissions: drop redundant CampaignEvents right assignments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116834 (https://phabricator.wikimedia.org/T376822) [16:43:10] (03PS2) 10Elukey: kartotherian: update Docker image and geoshapes yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) [16:43:10] (03PS3) 10Elukey: admin_ng: set new Docker images for Knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116826 (https://phabricator.wikimedia.org/T369493) [16:43:28] (03CR) 10Elukey: kartotherian: update Docker image and geoshapes yaml config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey) [16:43:50] (03PS2) 10Daimona Eaytoy: core-Permissions: drop redundant CampaignEvents right assignments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116834 (https://phabricator.wikimedia.org/T376822) [16:44:22] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1169.eqiad.wmnet onto db1251.eqiad.wmnet [16:44:36] (03CR) 10Scott French: [C:03+1] deployment_server: Don't choke on 'Extension:scriptname' in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1116535 (https://phabricator.wikimedia.org/T380533) (owner: 10RLazarus) [16:45:33] jouncebot: nowandnext [16:45:33] For the next 0 hour(s) and 14 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1630) [16:45:33] In 1 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1800) [16:45:33] In 1 hour(s) and 14 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1800) [16:46:03] (03CR) 10Jgiannelos: [C:03+1] kartotherian: update Docker image and geoshapes yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey) [16:46:07] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:46:37] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:46:53] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:46:56] (03CR) 10Elukey: [C:03+2] kartotherian: update Docker image and geoshapes yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey) [16:52:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73099 and previous config saved to /var/cache/conftool/dbconfig/20250203-165232-root.json [16:54:28] (03PS4) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) [16:54:38] (03PS1) 10Effie Mouzeli: shellbox: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116837 (https://phabricator.wikimedia.org/T377038) [16:54:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10517991 (10Papaul) @VRiley-WMF it looks like all those servers are connected to 1G. Can you please move them to 10G ports and update the task i can h... [16:56:17] (03PS1) 10Effie Mouzeli: shellbox-media: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116838 (https://phabricator.wikimedia.org/T377038) [16:57:14] (03CR) 10Alexandros Kosiaris: [C:03+1] php8.1: rebuild to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225) (owner: 10Scott French) [16:57:19] (03PS4) 10Arturo Borrero Gonzalez: cloudgw1003: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) [16:57:19] (03PS3) 10Arturo Borrero Gonzalez: cloudgw1004: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) [16:57:45] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:58:22] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [17:01:46] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10518010 (10aborrero) >>! In T382412#10512402, @cmooney wrote: > I'm guessing you're gonna migrate by removing on... [17:01:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [17:03:35] (03PS5) 10Arturo Borrero Gonzalez: cloudgw1003: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) [17:03:35] (03PS4) 10Arturo Borrero Gonzalez: cloudgw1004: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) [17:07:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73100 and previous config saved to /var/cache/conftool/dbconfig/20250203-170737-root.json [17:13:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T384592)', diff saved to https://phabricator.wikimedia.org/P73101 and previous config saved to /var/cache/conftool/dbconfig/20250203-171322-marostegui.json [17:13:26] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [17:20:31] (03CR) 10Gergő Tisza: [C:03+1] Enable $wgAllowAuthenticatedCrossOrigin on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [17:22:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73102 and previous config saved to /var/cache/conftool/dbconfig/20250203-172243-root.json [17:23:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10518139 (10Jhancock.wm) i got some instructions from dell. kind of similar to what we tried with some extra cables to reset. Is this server still depooled? [17:24:45] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 5/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:24:45] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:24:45] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:24:45] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:24:45] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:24:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10518143 (10MatthewVernon) Yeah, you can work on this server any time, but thanks for checking :) [17:25:45] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:25:45] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:27:42] (03PS1) 10CDanis: chart-renderer: new new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116843 (https://phabricator.wikimedia.org/T383748) [17:27:43] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:27:43] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:27:45] RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:28:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P73103 and previous config saved to /var/cache/conftool/dbconfig/20250203-172829-marostegui.json [17:32:15] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:32:38] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 10database-backups, and 2 others: decommission db2139 - https://phabricator.wikimedia.org/T383971#10518217 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:32:47] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:36:58] (03CR) 10Aude: [C:03+1] chart-renderer: new new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116843 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis) [17:37:04] (03CR) 10CDanis: [C:03+2] chart-renderer: new new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116843 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis) [17:37:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73104 and previous config saved to /var/cache/conftool/dbconfig/20250203-173748-root.json [17:38:20] (03CR) 10Jcrespo: [C:03+2] dbbackups: Fix dump grants for backup sources and m1 [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [17:38:21] (03Merged) 10jenkins-bot: chart-renderer: new new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116843 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis) [17:39:16] !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [17:39:52] !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [17:40:51] (03PS2) 10Jcrespo: dbbackups: Update grants for x1 dump sections too [puppet] - 10https://gerrit.wikimedia.org/r/1116831 (https://phabricator.wikimedia.org/T376916) [17:40:51] (03PS1) 10Jcrespo: dbbackups: Update grants for misc hosts other than m1 [puppet] - 10https://gerrit.wikimedia.org/r/1116845 (https://phabricator.wikimedia.org/T383902) [17:40:53] (03PS1) 10Jcrespo: dbbackups: Remove last references to dbprov[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/1116846 (https://phabricator.wikimedia.org/T383902) [17:42:58] (03CR) 10Jcrespo: "This is technically a production change, just happens to be part of the backup app." [puppet] - 10https://gerrit.wikimedia.org/r/1116846 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [17:43:25] (03CR) 10Jcrespo: "This is pending to be deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1116845 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [17:43:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P73105 and previous config saved to /var/cache/conftool/dbconfig/20250203-174336-marostegui.json [17:44:09] (03CR) 10Jcrespo: [C:03+2] dbbackups: Update grants for x1 dump sections too [puppet] - 10https://gerrit.wikimedia.org/r/1116831 (https://phabricator.wikimedia.org/T376916) (owner: 10Jcrespo) [17:44:29] (03CR) 10Jcrespo: [C:03+2] "Merging this as this was deployed at the same time than the previous backup source change." [puppet] - 10https://gerrit.wikimedia.org/r/1116831 (https://phabricator.wikimedia.org/T376916) (owner: 10Jcrespo) [17:46:10] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [17:46:42] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [17:46:45] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [17:47:12] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [17:57:38] (03PS5) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) [17:58:31] !log [urbanecm@deploy2002 ~]$ mwscript-k8s -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=newiki --logwiki=metawiki 'JOestby' 'Johannesoestby' # T385503 [17:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:34] T385503: Unblock stuck global renames - https://phabricator.wikimedia.org/T385503 [17:58:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T384592)', diff saved to https://phabricator.wikimedia.org/P73106 and previous config saved to /var/cache/conftool/dbconfig/20250203-175843-marostegui.json [17:58:46] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [17:58:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1238.eqiad.wmnet with reason: Maintenance [17:59:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T384592)', diff saved to https://phabricator.wikimedia.org/P73107 and previous config saved to /var/cache/conftool/dbconfig/20250203-175904-marostegui.json [17:59:18] [urbanecm@deploy2002 ~]$ mwscript-k8s -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=newiki --logwiki=metawiki 'Tarasssst' 'TR101' # T385503 [17:59:24] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [18:00:04] swfrench-wmf: Time to snap out of that daydream and deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1800). [18:00:04] ryankemper: Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1800). Please do the needful. [18:00:20] o/ [18:01:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115966 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:01:31] !log [urbanecm@deploy2002 ~]$ mwscript-k8s -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=newiki --logwiki=metawiki 'Tarasssst' 'TR101' # T385503 [18:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:41] that's why the log entry didn't make it to the task... missing !_log [18:02:12] (03Merged) 10jenkins-bot: Enroll 10% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115966 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:02:31] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1115966|Enroll 10% of client sessions in PHP 8.1 (T383845)]] [18:02:34] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:06:19] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1115966|Enroll 10% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:07:18] !log swfrench@deploy2002 swfrench: Continuing with sync [18:13:45] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115966|Enroll 10% of client sessions in PHP 8.1 (T383845)]] (duration: 11m 13s) [18:13:48] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:19:17] (03CR) 10Scott French: [C:03+2] mw-api-int: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115972 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:20:26] (03Merged) 10jenkins-bot: mw-api-int: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115972 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:25:28] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:27:06] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:27:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:27:49] (03CR) 10Herron: [C:03+2] aux_k8s: apply etcd_aux_k8s role to aux-k8s-etcd200[345] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1116825 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [18:28:51] !log mw-api-int to ~ 1% of traffic on PHP 8.1 in eqiad - T383845 [18:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:54] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:29:39] (03PS3) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [18:29:40] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [18:32:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:37:55] FIRING: [3x] SystemdUnitFailed: etcd.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:28] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:39:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10518605 (10phaultfinder) [18:39:41] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:41:29] !log mw-api-int to ~ 1% of traffic on PHP 8.1 in codfw - T383845 [18:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:31] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:42:55] FIRING: [4x] SystemdUnitFailed: etcd.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:45:19] ^ expected, being turned up by herron [18:45:56] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:46:11] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:47:55] FIRING: [5x] SystemdUnitFailed: etcd.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:48:10] FIRING: [5x] SystemdUnitFailed: etcd.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:48:50] I'll enter some sliences [18:50:30] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1169.eqiad.wmnet onto db1251.eqiad.wmnet [18:50:52] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:51:07] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:52:55] FIRING: [4x] SystemdUnitFailed: etcd.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:55:55] (03PS1) 10Urbanecm: [Growth] enwiki: Enable mentorship for 75% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116853 (https://phabricator.wikimedia.org/T384505) [18:55:55] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [19:00:09] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt fransw1001 - vriley@cumin1002" [19:00:13] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt fransw1001 - vriley@cumin1002" [19:00:13] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:02:21] PROBLEM - Check unit status of etcd-backup on aux-k8s-etcd2003 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:06:23] PROBLEM - Check unit status of etcd-backup on aux-k8s-etcd2004 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:10:25] PROBLEM - Check unit status of etcd-backup on aux-k8s-etcd2005 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:11:30] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10518711 (10herron) >>! In T381417#10518550, @gerritbot wrote: > Change #1116825 **merged** by Herron: > %%%[operations/puppet@production] aux_k8s... [19:17:37] jouncebot: nowandnext [19:17:37] No deployments scheduled for the next 1 hour(s) and 42 minute(s) [19:17:37] In 1 hour(s) and 42 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T2100) [19:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10518739 (10phaultfinder) [19:19:57] RECOVERY - Host analytics1073 is UP: PING WARNING - Packet loss = 33%, RTA = 0.93 ms [19:21:15] unless there are any objections, I'd like to use this quiet spot to deploy a fix for T385225, which will require a scap deployment in order to pick up a new base image [19:21:16] T385225: Mercurius does not retry failed transcodes beyond 15m - https://phabricator.wikimedia.org/T385225 [19:26:21] PROBLEM - Host analytics1073 is DOWN: PING CRITICAL - Packet loss = 100% [19:26:45] (03PS1) 10Dwisehaupt: Another CNAME for acoustic landing pages [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) [19:27:19] (03CR) 10Dwisehaupt: "One last one needed before acoustic can shift the site over to the new cert." [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt) [19:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10518765 (10phaultfinder) [19:31:01] (03CR) 10Dwisehaupt: [C:04-2] "Acoustic provided the wrong info. Updating with the correct info in a sec." [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt) [19:31:38] swfrench-wmf: no rush but whenever you're done with that I'll roll out an apache config change [19:31:52] (if there's time before the 21:00 backport window, and if not I can just do it later) [19:32:20] rzl: ack, thanks! feel free to go ahead, actually - I appear to have stepped on a reprepro rake :) [19:32:45] sure, going! good luck with your rake [19:32:58] (03CR) 10RLazarus: [V:03+1 C:03+2] Use new 'auth' docroot for the auth domain [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [19:37:56] works on metal mwdebug, scapping [19:38:36] rzl: I think I've got my end sorted, so you'll see me doing a bit of work in the background, but nothing that should actually change production until you're done :) [19:38:57] 👍 [19:39:11] thanks for the heads up! I'll let you know when I'm hands-off [19:40:19] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:40:31] !log ran reprepro include mercurius 1.1.0-1 - T385225 [19:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:34] T385225: Mercurius does not retry failed transcodes beyond 15m - https://phabricator.wikimedia.org/T385225 [19:40:47] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:43:23] oh, thanks for deploying that :) [19:45:59] sure thing :) thanks for your patience [19:46:50] (03CR) 10Lucas Werkmeister: Enable $wgAllowAuthenticatedCrossOrigin on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [19:46:58] !log rzl@deploy2002 Started scap sync-world: T383952, T384137 [19:47:02] T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952 [19:47:02] T384137: Set up robots.txt in auth.wikimedia.org - https://phabricator.wikimedia.org/T384137 [19:50:16] MatmaRex: hmm, deployed to the testervers but the new httpbb tests for robots.txt and favicon.ico are failing [19:50:23] https://www.irccloud.com/pastebin/LI4ZYclY/ [19:50:37] I'm taking a look but since you're around, in case you want to dig :) [19:50:52] rzl: could be cached [19:51:08] on the appserver? [19:51:30] oh [19:52:27] yeah, that's not right [19:53:01] (03PS2) 10Lucas Werkmeister: Enable $wgAllowAuthenticatedCrossOrigin on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) [19:53:01] (03PS1) 10Lucas Werkmeister: DNM: Enable $wgAllowAuthenticatedCrossOrigin on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116860 (https://phabricator.wikimedia.org/T322944) [19:53:21] it looks like instead of serving the files directly, it's being rewritten to… static.php, probably? [19:53:29] (03CR) 10Lucas Werkmeister: [C:04-1] "do not merge yet, only uploading this because I already wanted to write the comment down" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116860 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [19:53:50] (03CR) 10CI reject: [V:04-1] DNM: Enable $wgAllowAuthenticatedCrossOrigin on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116860 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [19:54:36] (03PS2) 10Lucas Werkmeister: DNM: Enable $wgAllowAuthenticatedCrossOrigin on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116860 (https://phabricator.wikimedia.org/T322944) [19:54:49] the tests are passing on mwdebug2001.codfw.wmnet, but not on mwdebug.discovery.wmnet:4444, so normally I would think we didn't bring the config change into the k8s side correctly [19:54:59] except that the diffs did look correct [19:57:01] e.g. (only including mw-web-main, the others all looked the same) https://www.irccloud.com/pastebin/mA7qkGKv/ [19:57:31] cc swfrench-wmf in case you can spot something I missed [19:57:40] looking [19:58:14] all the extra +ServerName -ServerNames do cancel out correctly, one of those cases where the minimal lines diff isn't the same as the minimal semantic one [20:00:55] also curious that it's a 500 for mwdebug but a 503 for mwdebug-next [20:00:57] rzl: it seems to me that there should be a diff for the lines controlled by "public_rewrites: false", almost at the very bottom, but there isn't [20:01:08] (which I confirm from the browser) [20:03:29] the behavior sure looks like "public_rewrites: false" is not having any effect. did i put it in the wrong place or something? [20:03:30] MatmaRex: hmm, true [20:03:57] it did have the correct effect on the bare-metal hosts but not in the k8s version, I wonder if we do the templating differently there [20:04:22] okay, I'm inclined to revert and try again another time -- any data you want to collect first? [20:04:48] not really [20:04:51] (this isn't causing any harm and we can keep looking, I just want to relinquish the conch and unblock swfrench-wmf's other thing) [20:05:34] ^ also neat, I exited scap and I guess that killed logmsgbot [20:06:25] meanwhile, good job writing the tests :) [20:06:34] (03PS1) 10RLazarus: Revert "Use new 'auth' docroot for the auth domain" [puppet] - 10https://gerrit.wikimedia.org/r/1116862 [20:06:56] if you all need more time for debugging, that's totally fine - my change shouldn't take _too_ long to get out (aside from surprises that result in slow image builds) [20:07:37] nah I think we have what we need -- we might be able to make another try after you're done, depending [20:07:47] rzl: heh, yeah, thanks for making me add them ;) [20:07:51] just awaiting slowkins [20:08:06] do you have any idea why this would work differently under kubernetes? (i don't) [20:08:44] specifically no, but we reproduced a lot of the templating logic and I wouldn't be stunned if they inadvertently handle that variable differently -- I can start digging once this is rolled back [20:08:47] (03CR) 10CI reject: [V:04-1] Revert "Use new 'auth' docroot for the auth domain" [puppet] - 10https://gerrit.wikimedia.org/r/1116862 (owner: 10RLazarus) [20:08:48] FIRING: [2x] PuppetFailure: Puppet has failed on aux-k8s-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:09:26] ughhhhh yes okay I didn't word-wrap my own revert message in the gerrit box [20:09:33] you're so right to save production from my recklessness [20:09:45] we can't have that! [20:10:20] (03PS2) 10RLazarus: Revert "Use new 'auth' docroot for the auth domain" [puppet] - 10https://gerrit.wikimedia.org/r/1116862 [20:10:46] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [20:12:40] (03CR) 10RLazarus: [C:03+2] Revert "Use new 'auth' docroot for the auth domain" [puppet] - 10https://gerrit.wikimedia.org/r/1116862 (owner: 10RLazarus) [20:13:48] FIRING: [3x] PuppetFailure: Puppet has failed on aux-k8s-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:21:52] rolled back successfully on mwdebug, running puppet on deploy2002 now and then scapping [20:23:46] rzl: btw, i wonder, is there a kubernets-based environemnt on the beta cluster? i am wondering if i could have caught this problem before the production deployment [20:25:29] I don't know :) [20:26:47] (03CR) 10Scott French: "Thank you both for the reviews!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225) (owner: 10Scott French) [20:27:04] (03CR) 10Scott French: [V:03+2] "Verified to build locally." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225) (owner: 10Scott French) [20:27:30] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: rebuild to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225) (owner: 10Scott French) [20:27:55] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10518948 (10phaultfinder) [20:29:46] just for completeness, since I excerpted the original diffs, here's the full output from the rollback, since there's only a diff against mw-debug and friends -- just note the +s and -s are reversed since it's a revert https://www.irccloud.com/pastebin/jiq3q4CW/ [20:29:56] !log rzl@deploy2002 Started scap sync-world: T383952, T384137 [20:30:00] T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952 [20:30:00] T384137: Set up robots.txt in auth.wikimedia.org - https://phabricator.wikimedia.org/T384137 [20:30:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:30:32] ^ expected from httpbb version skew with this puppet change, will self-resolve [20:31:53] !log rzl@deploy2002 rzl: T383952, T384137 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:32:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:32:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:32:35] !log rzl@deploy2002 rzl: Continuing with sync [20:32:49] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:33:11] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:33:35] !log rzl@deploy2002 Finished scap sync-world: T383952, T384137 (duration: 06m 10s) [20:34:01] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:34:02] swfrench-wmf: all yours [20:34:12] rzl: ack, thank you! [20:34:32] FIRING: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:34:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:34:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:35:39] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:35:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:37:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:39:08] (03PS1) 10Herron: wmnet: add codfw aux-k8s-etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1116867 (https://phabricator.wikimedia.org/T381417) [20:48:58] 06SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701#10519011 (10Nemoralis) It looks like there is https://fox.nexus/@wikistatus run by @TheresNoTime [20:51:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116853 (https://phabricator.wikimedia.org/T384505) (owner: 10Urbanecm) [20:54:50] (03PS1) 10Andrew Bogott: haproxy/keystone: change balance algorithm to 'source' for public keystone [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) [20:55:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott) [20:59:45] (03PS2) 10Andrew Bogott: haproxy/keystone: change balance algorithm to 'source' for public keystone [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) [20:59:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T2100) [21:00:05] Sohom_Datta and urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] i can deploy today [21:00:32] swfrench-wmf: rzl: i noticed you did something mw related recently, all done? [21:01:25] urbanecm: so, work on my end is a bit stuck at the moment, but at the specific point it's stuck, you should be able to proceed with your backport [21:01:38] ack, ty [21:01:51] if you could check in with me before you proceed to the 2nd one, that would be greatly appreciated [21:01:59] sure [21:02:04] i don't see Sohom_Datta in here [21:02:07] so i'll just do the config now [21:02:34] sounds good [21:02:34] (03CR) 10Urbanecm: [C:03+2] [Growth] enwiki: Enable mentorship for 75% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116853 (https://phabricator.wikimedia.org/T384505) (owner: 10Urbanecm) [21:03:19] (03Merged) 10jenkins-bot: [Growth] enwiki: Enable mentorship for 75% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116853 (https://phabricator.wikimedia.org/T384505) (owner: 10Urbanecm) [21:04:21] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! Applies cleanly locally on git master." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1104740 (owner: 10Pppery) [21:04:34] (03PS3) 10Andrew Bogott: haproxy/keystone: change balance algorithm to 'source' for public keystone [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) [21:04:37] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott) [21:04:49] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1116853|[Growth] enwiki: Enable mentorship for 75% of new accounts (T384505)]] [21:04:52] T384505: Increase the number of new accounts getting a mentor at English Wikipedia - https://phabricator.wikimedia.org/T384505 [21:08:33] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1116853|[Growth] enwiki: Enable mentorship for 75% of new accounts (T384505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:40] !log urbanecm@deploy2002 urbanecm: Continuing with sync [21:09:38] (03PS4) 10Andrew Bogott: haproxy/keystone: change balance algorithm to 'source' for public keystone [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) [21:10:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott) [21:14:08] (03PS5) 10Andrew Bogott: haproxy/keystone: change balance algorithm to 'source' for public keystone [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) [21:14:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott) [21:15:11] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1116853|[Growth] enwiki: Enable mentorship for 75% of new accounts (T384505)]] (duration: 10m 22s) [21:15:14] T384505: Increase the number of new accounts getting a mentor at English Wikipedia - https://phabricator.wikimedia.org/T384505 [21:15:41] still no sign of Sohom_Datta [21:15:53] swfrench-wmf: done [21:16:04] urbanecm: ack, thanks! [21:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10519118 (10phaultfinder) [21:27:05] (03CR) 10Andrew Bogott: [C:03+2] "self-merging because I have users who are struggling with this right now." [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott) [21:30:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:30:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10519135 (10phaultfinder) [21:32:08] RESOLVED: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:32:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:34:43] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:34:43] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:34:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:37:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:48:44] (03PS2) 10Dwisehaupt: Shift CNAME for acoustic landing pages [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) [21:49:14] (03CR) 10Dwisehaupt: Shift CNAME for acoustic landing pages [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt) [22:00:05] Reedy, sbassett, Maryum, and manfredi: That opportune time for a Weekly Security deployment window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T2200). [22:09:57] (03CR) 10BCornwall: [C:03+1] Shift CNAME for acoustic landing pages [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt) [22:11:04] (03CR) 10BCornwall: [C:03+1] wmnet: add codfw aux-k8s-etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1116867 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [22:13:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#10519204 (10VRiley-WMF) [22:15:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#10519206 (10VRiley-WMF) a:03Jgreen Hey Jeff I was able to connect this unit up with 10G. It should now be ready to go for you. Let me know if you need anyt... [22:39:59] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [22:43:57] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt franio1001 - vriley@cumin1002" [22:44:02] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt franio1001 - vriley@cumin1002" [22:44:02] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:52:26] (03PS4) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [22:52:48] (03CR) 10CI reject: [V:04-1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [22:54:22] (03PS5) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [22:55:52] (03CR) 10CI reject: [V:04-1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [22:55:53] (03Abandoned) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 (owner: 10Jdlrobson) [23:01:24] (03PS6) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [23:01:50] (03CR) 10CI reject: [V:04-1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [23:01:53] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [23:02:21] (03PS7) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [23:03:51] (03CR) 10CI reject: [V:04-1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [23:08:09] (03CR) 10Dwisehaupt: [C:03+2] "Verbal confirmation from jgreen also." [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt) [23:08:19] !log dwisehaupt@dns1004 START - running authdns-update [23:09:53] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt franio1002 - vriley@cumin1002" [23:09:57] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt franio1002 - vriley@cumin1002" [23:09:57] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:10:14] !log dwisehaupt@dns1004 END - running authdns-update [23:14:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T384592)', diff saved to https://phabricator.wikimedia.org/P73108 and previous config saved to /var/cache/conftool/dbconfig/20250203-231428-marostegui.json [23:14:31] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [23:29:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P73109 and previous config saved to /var/cache/conftool/dbconfig/20250203-232933-marostegui.json [23:32:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10519367 (10VRiley-WMF) a:03Jgreen [23:33:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10519369 (10VRiley-WMF) All these should be set and ready to go! @Jgreen Let me know if you need anything else! [23:44:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P73110 and previous config saved to /var/cache/conftool/dbconfig/20250203-234440-marostegui.json [23:59:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T384592)', diff saved to https://phabricator.wikimedia.org/P73111 and previous config saved to /var/cache/conftool/dbconfig/20250203-235947-marostegui.json [23:59:51] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592