[00:08:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:10:31] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 619.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:54:49] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 1347 MB (1% inode=98%): /tmp 1347 MB (1% inode=98%): /var/tmp 1347 MB (1% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[02:08:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:09:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T384592)', diff saved to https://phabricator.wikimedia.org/P73047 and previous config saved to /var/cache/conftool/dbconfig/20250203-020900-marostegui.json
[02:09:04] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[02:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:24:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P73048 and previous config saved to /var/cache/conftool/dbconfig/20250203-022407-marostegui.json
[02:34:31] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:37:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P73049 and previous config saved to /var/cache/conftool/dbconfig/20250203-023914-marostegui.json
[02:54:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T384592)', diff saved to https://phabricator.wikimedia.org/P73050 and previous config saved to /var/cache/conftool/dbconfig/20250203-025421-marostegui.json
[02:54:24] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[02:54:37] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[02:54:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T384592)', diff saved to https://phabricator.wikimedia.org/P73051 and previous config saved to /var/cache/conftool/dbconfig/20250203-025443-marostegui.json
[03:07:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:16:29] <jinxer-wm>	 FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[06:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:24:05] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:24:41] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:16:29] <jinxer-wm>	 FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[07:49:05] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:49:41] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:12:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:29:34] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[08:30:11] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[08:31:24] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply
[08:32:03] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply
[08:34:35] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[08:35:36] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[08:36:37] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[08:37:03] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[08:37:15] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[08:37:35] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[09:06:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1182 T385084', diff saved to https://phabricator.wikimedia.org/P73052 and previous config saved to /var/cache/conftool/dbconfig/20250203-090558-marostegui.json
[09:06:02] <stashbot>	 T385084: Upgrade and rebuild s2 - https://phabricator.wikimedia.org/T385084
[09:06:19] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1182.eqiad.wmnet
[09:07:10] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Index rebuild + upgrade
[09:12:56] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1182.eqiad.wmnet
[09:13:33] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Index rebuild
[09:14:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2031 to es2 codfw master dbtmaint T376905', diff saved to https://phabricator.wikimedia.org/P73053 and previous config saved to /var/cache/conftool/dbconfig/20250203-091450-root.json
[09:17:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T384592)', diff saved to https://phabricator.wikimedia.org/P73054 and previous config saved to /var/cache/conftool/dbconfig/20250203-091700-marostegui.json
[09:17:03] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[09:17:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:18:56] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[09:22:34] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Index rebuild
[09:22:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:25:55] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on es2026.codfw.wmnet with reason: Kernel reboot
[09:26:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2026 for kernel reboot', diff saved to https://phabricator.wikimedia.org/P73055 and previous config saved to /var/cache/conftool/dbconfig/20250203-092559-marostegui.json
[09:26:27] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2026.codfw.wmnet
[09:29:23] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on es2037.codfw.wmnet with reason: Kernel reboot
[09:29:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2037 for kernel reboot', diff saved to https://phabricator.wikimedia.org/P73057 and previous config saved to /var/cache/conftool/dbconfig/20250203-092928-marostegui.json
[09:29:40] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2037.codfw.wmnet
[09:32:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P73058 and previous config saved to /var/cache/conftool/dbconfig/20250203-093207-marostegui.json
[09:35:15] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2037.codfw.wmnet
[09:36:09] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2026.codfw.wmnet
[09:36:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73059 and previous config saved to /var/cache/conftool/dbconfig/20250203-093613-root.json
[09:36:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73060 and previous config saved to /var/cache/conftool/dbconfig/20250203-093628-root.json
[09:40:31] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 10409MiB (2% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[09:47:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P73061 and previous config saved to /var/cache/conftool/dbconfig/20250203-094714-marostegui.json
[09:51:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73062 and previous config saved to /var/cache/conftool/dbconfig/20250203-095118-root.json
[09:51:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73063 and previous config saved to /var/cache/conftool/dbconfig/20250203-095133-root.json
[10:00:31] <icinga-wm>	 RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[10:02:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T384592)', diff saved to https://phabricator.wikimedia.org/P73064 and previous config saved to /var/cache/conftool/dbconfig/20250203-100221-marostegui.json
[10:02:24] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[10:02:37] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance
[10:02:54] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:03:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T384592)', diff saved to https://phabricator.wikimedia.org/P73065 and previous config saved to /var/cache/conftool/dbconfig/20250203-100300-marostegui.json
[10:06:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73066 and previous config saved to /var/cache/conftool/dbconfig/20250203-100623-root.json
[10:06:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73067 and previous config saved to /var/cache/conftool/dbconfig/20250203-100638-root.json
[10:21:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73068 and previous config saved to /var/cache/conftool/dbconfig/20250203-102129-root.json
[10:21:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73069 and previous config saved to /var/cache/conftool/dbconfig/20250203-102144-root.json
[10:27:08] <jinxer-wm>	 FIRING: ProbeDown: Service restbase2035-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase2035-b:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:29:32] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service restbase2035-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:53] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:31:53] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:36:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73070 and previous config saved to /var/cache/conftool/dbconfig/20250203-103634-root.json
[10:36:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73071 and previous config saved to /var/cache/conftool/dbconfig/20250203-103649-root.json
[10:39:32] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service restbase2035-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:50:31] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 3093MiB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[10:59:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2034 to es3 codfw master dbtmaint T376905', diff saved to https://phabricator.wikimedia.org/P73072 and previous config saved to /var/cache/conftool/dbconfig/20250203-105915-root.json
[10:59:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2027 for kernel reboot', diff saved to https://phabricator.wikimedia.org/P73073 and previous config saved to /var/cache/conftool/dbconfig/20250203-105935-marostegui.json
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1100)
[11:00:35] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2027.codfw.wmnet
[11:02:17] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[11:04:08] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[11:10:33] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2027.codfw.wmnet
[11:10:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73074 and previous config saved to /var/cache/conftool/dbconfig/20250203-111052-root.json
[11:23:53] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1155.eqiad.wmnet with reason: Kernel reboot
[11:24:05] <marostegui>	 !log Reboot and upgrade db1155
[11:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73075 and previous config saved to /var/cache/conftool/dbconfig/20250203-112558-root.json
[11:27:11] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s7 on clouddb1018 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3317 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:27:21] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s7 on clouddb1014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3317 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:27:21] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s2 on clouddb1014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3312 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:27:25] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s2 on clouddb1018 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3312 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:27:34] <marostegui>	 ^ downtiming
[11:27:54] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb1018.eqiad.wmnet with reason: Kernel reboot
[11:28:18] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb1014.eqiad.wmnet with reason: Kernel reboot
[11:35:11] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s7 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:35:21] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s2 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:35:21] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s7 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:35:25] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s2 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:41:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73076 and previous config saved to /var/cache/conftool/dbconfig/20250203-114103-root.json
[11:56:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer)
[11:56:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73077 and previous config saved to /var/cache/conftool/dbconfig/20250203-115608-root.json
[11:57:33] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] jobqueue: bump ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115899 (https://phabricator.wikimedia.org/T385273) (owner: 10Hnowlan)
[11:59:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:10:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115791 (https://phabricator.wikimedia.org/T378527) (owner: 10Urbanecm)
[12:11:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73078 and previous config saved to /var/cache/conftool/dbconfig/20250203-121113-root.json
[12:11:46] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mediawiki: Various fixes for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115453 (owner: 10Clément Goubert)
[12:12:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime)
[12:13:51] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Various fixes for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115453 (owner: 10Clément Goubert)
[12:17:25] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116778
[12:19:00] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[12:19:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[12:19:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:22:29] <wikibugs>	 (03PS2) 10Clément Goubert: mw-cron: Fix test job name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116781
[12:24:01] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-cron: Fix test job name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116781 (owner: 10Clément Goubert)
[12:25:02] <wikibugs>	 (03Merged) 10jenkins-bot: mw-cron: Fix test job name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116781 (owner: 10Clément Goubert)
[12:25:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[12:25:15] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[12:28:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10516521 (10ayounsi) Looks all good to me !  First start with non-paging, and revisit later on.  I'm wondering if we could re-wri...
[12:39:28] <wikibugs>	 (03PS1) 10Clément Goubert: mw_releases: Add mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1116782 (https://phabricator.wikimedia.org/T341555)
[12:39:57] <wikibugs>	 (03PS2) 10Clément Goubert: mw_releases: Add mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1116782 (https://phabricator.wikimedia.org/T377962)
[12:41:36] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw_releases: Add mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1116782 (https://phabricator.wikimedia.org/T377962) (owner: 10Clément Goubert)
[12:42:34] <wikibugs>	 06SRE, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10516577 (10Ahonc)
[12:46:40] <Reedy>	 jouncebot: nowandnext
[12:46:41] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 13 minute(s)
[12:46:41] <jouncebot>	 In 1 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1400)
[12:47:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73079 and previous config saved to /var/cache/conftool/dbconfig/20250203-124721-root.json
[12:47:26] <wikibugs>	 (03PS1) 10Reedy: Add missing array_values for PHP 7 compatibility [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116783 (https://phabricator.wikimedia.org/T385255)
[12:47:31] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Add missing array_values for PHP 7 compatibility [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116783 (https://phabricator.wikimedia.org/T385255) (owner: 10Reedy)
[12:47:59] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:48:37] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:50:51] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.010 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:51:10] <logmsgbot>	 !log cgoubert@deploy2002 Started scap sync-world: Testing scap deployment of mw-cron
[12:51:27] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:51:39] <claime>	 Reedy: a minute, doing a thing 
[12:52:02] <Reedy>	 claime: CI will take a while anyway. No rush on my part to actually pull/deploy )
[12:52:04] <Reedy>	 :)
[12:52:09] <claime>	 cool :)
[12:52:36] <Reedy>	 just lining up some stuff to reduce the php8.1 logspam further
[12:53:15] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap sync-world: Testing scap deployment of mw-cron (duration: 02m 46s)
[12:53:30] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[12:53:32] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[12:55:26] <wikibugs>	 (03PS1) 10Reedy: SpecialMathWikibase: Null-coalescence getDescription() call [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116784 (https://phabricator.wikimedia.org/T385170)
[12:55:31] <wikibugs>	 (03CR) 10Reedy: [C:03+2] SpecialMathWikibase: Null-coalescence getDescription() call [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116784 (https://phabricator.wikimedia.org/T385170) (owner: 10Reedy)
[12:55:46] <logmsgbot>	 !log cgoubert@deploy2002 Started scap sync-world: Rebuild image and release file for mw-cron
[12:55:56] <wikibugs>	 (03PS1) 10Reedy: SpecialMathWikibase: Null-coalescence $par [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116785 (https://phabricator.wikimedia.org/T385269)
[12:56:01] <wikibugs>	 (03CR) 10Reedy: [C:03+2] SpecialMathWikibase: Null-coalescence $par [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116785 (https://phabricator.wikimedia.org/T385269) (owner: 10Reedy)
[12:56:34] <wikibugs>	 (03Merged) 10jenkins-bot: Add missing array_values for PHP 7 compatibility [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116783 (https://phabricator.wikimedia.org/T385255) (owner: 10Reedy)
[12:59:40] <wikibugs>	 (03PS1) 10Reedy: ApiQueryContentTranslationSuggestions: Set default value for to and from parameters [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116788 (https://phabricator.wikimedia.org/T385267)
[12:59:46] <wikibugs>	 (03CR) 10Reedy: [C:03+2] ApiQueryContentTranslationSuggestions: Set default value for to and from parameters [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116788 (https://phabricator.wikimedia.org/T385267) (owner: 10Reedy)
[13:01:23] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3
[13:01:24] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1116789 (https://phabricator.wikimedia.org/T385457)
[13:01:29] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1116790 (https://phabricator.wikimedia.org/T385457)
[13:02:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73080 and previous config saved to /var/cache/conftool/dbconfig/20250203-130226-root.json
[13:02:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2205 with weight 0 T385457', diff saved to https://phabricator.wikimedia.org/P73081 and previous config saved to /var/cache/conftool/dbconfig/20250203-130248-root.json
[13:02:51] <stashbot>	 T385457: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T385457
[13:03:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1116789 (https://phabricator.wikimedia.org/T385457) (owner: 10Gerrit maintenance bot)
[13:06:31] <marostegui>	 !log Emergency s3 switchover T385457
[13:06:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:04] <logmsgbot>	 !log cgoubert@deploy2002 Stopping before sync operations
[13:07:10] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[13:07:13] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[13:07:18] <jynus>	 what's the puppet failure about?
[13:07:28] <wikibugs>	 (03Merged) 10jenkins-bot: SpecialMathWikibase: Null-coalescence getDescription() call [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116784 (https://phabricator.wikimedia.org/T385170) (owner: 10Reedy)
[13:08:27] <claime>	 Reedy: all yours
[13:08:42] <marostegui>	 Please do not deploy things now
[13:08:42] <wikibugs>	 (03Merged) 10jenkins-bot: SpecialMathWikibase: Null-coalescence $par [extensions/Math] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116785 (https://phabricator.wikimedia.org/T385269) (owner: 10Reedy)
[13:08:49] <claime>	 marostegui: ack
[13:08:59] <claime>	 want me to put a scap lock?
[13:09:07] <marostegui>	 yes please
[13:09:12] <claime>	 ack
[13:09:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:09:36] <logmsgbot>	 !log cgoubert@deploy2002 Locking from deployment [MediaWiki]: Emergency s3 switchover T385457
[13:09:38] <stashbot>	 T385457: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T385457
[13:10:43] <jynus>	 So a change caused wmcs and insetup puppet failures around 11:52
[13:11:32] <wikibugs>	 (03Merged) 10jenkins-bot: ApiQueryContentTranslationSuggestions: Set default value for to and from parameters [extensions/ContentTranslation] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116788 (https://phabricator.wikimedia.org/T385267) (owner: 10Reedy)
[13:12:24] <jynus>	 but I don't see any relevant change
[13:14:36] <logmsgbot>	 !log jebe@deploy2002 Started deploy [airflow-dags/analytics_product@ce1f0f6]: (no justification provided)
[13:14:41] <jynus>	 Error: /Stage[main]/Prometheus::Node_kernel_messages/File[/etc/prometheus-node-kernel-messages-ignore-regex.txt]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/prometheus/prometheus-node-kernel-messages-ignore-regex.txt
[13:14:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s3 codfw as read-only for maintenance - T385457', diff saved to https://phabricator.wikimedia.org/P73082 and previous config saved to /var/cache/conftool/dbconfig/20250203-131452-root.json
[13:14:55] <stashbot>	 T385457: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T385457
[13:15:10] <logmsgbot>	 !log jebe@deploy2002 Finished deploy [airflow-dags/analytics_product@ce1f0f6]: (no justification provided) (duration: 00m 36s)
[13:15:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2205 to s3 primary and set section read-write T385457', diff saved to https://phabricator.wikimedia.org/P73083 and previous config saved to /var/cache/conftool/dbconfig/20250203-131542-root.json
[13:16:00] <jynus>	 ok, found it, must be 7ca645dbb arturo
[13:16:12] <wikibugs>	 (03Abandoned) 10Jbond: netbox: update netbox service definition so it pages [puppet] - 10https://gerrit.wikimedia.org/r/808197 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[13:16:20] <arturo>	 jynus: sending a fix
[13:16:26] <jynus>	 ok, no prob
[13:16:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2209 T385457', diff saved to https://phabricator.wikimedia.org/P73084 and previous config saved to /var/cache/conftool/dbconfig/20250203-131631-marostegui.json
[13:16:36] <marostegui>	 claime: we can deploy again
[13:16:54] <wikibugs>	 (03Abandoned) 10Jbond: POC: P:thanos::swift::frontend:  move ring manager config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/773794 (owner: 10Jbond)
[13:17:12] <logmsgbot>	 !log cgoubert@deploy2002 Unlocked for deployment [MediaWiki]: Emergency s3 switchover T385457 (duration: 07m 36s)
[13:17:21] <wikibugs>	 (03PS1) 10Lucas Werkmeister: Enable $wgAllowAuthenticatedCrossOrigin on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944)
[13:17:22] <claime>	 cool, lock lifted, thanks
[13:17:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73085 and previous config saved to /var/cache/conftool/dbconfig/20250203-131732-root.json
[13:18:26] <wikibugs>	 (03Abandoned) 10Jbond: R:system::role: colour system role based on its name [puppet] - 10https://gerrit.wikimedia.org/r/849497 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond)
[13:18:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister)
[13:19:13] <wikibugs>	 (03CR) 10Lucas Werkmeister: Enable $wgAllowAuthenticatedCrossOrigin on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister)
[13:19:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:20:43] <jynus>	 arturo: I am merging a few automated phab tickets into 1
[13:21:06] <arturo>	 jynus: sure
[13:21:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: prometheus: node_kernel_messages: fix ignore regex file path [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960)
[13:21:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez)
[13:21:28] <jynus>	 I will use ^ as the canonical one
[13:21:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1116790 (https://phabricator.wikimedia.org/T385457) (owner: 10Gerrit maintenance bot)
[13:22:01] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[13:22:20] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: prometheus: node_kernel_messages: fix ignore regex file path [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960)
[13:22:27] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez)
[13:23:13] <wikibugs>	 (03PS1) 10Marostegui: db2209: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1116798 (https://phabricator.wikimedia.org/T385457)
[13:23:34] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2209.codfw.wmnet
[13:23:45] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: prometheus: node_kernel_messages: fix ignore regex file path [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960)
[13:23:46] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2209: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1116798 (https://phabricator.wikimedia.org/T385457) (owner: 10Marostegui)
[13:23:52] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[13:24:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez)
[13:27:08] <logmsgbot>	 !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1116783|Add missing array_values for PHP 7 compatibility (T385255)]], [[gerrit:1116784|SpecialMathWikibase: Null-coalescence getDescription() call (T385170)]], [[gerrit:1116785|SpecialMathWikibase: Null-coalescence $par (T385269)]], [[gerrit:1116788|ApiQueryContentTranslationSuggestions: Set default value for to and from parameters (T385267)]]
[13:27:16] <stashbot>	 T385255: Error: Cannot unpack array with string keys - https://phabricator.wikimedia.org/T385255
[13:27:16] <stashbot>	 T385170: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T385170
[13:27:17] <stashbot>	 T385269: PHP Deprecated: str_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385269
[13:27:17] <stashbot>	 T385267: PHP Deprecated: str_replace(): Passing null to parameter #2 ($replace) of type array|string is deprecated - https://phabricator.wikimedia.org/T385267
[13:27:43] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2209.codfw.wmnet
[13:28:01] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops: mwgrep cannot be used from a deployment host - https://phabricator.wikimedia.org/T384764#10516923 (10jijiki) p:05Triage→03Medium
[13:28:30] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2209.codfw.wmnet with reason: Index rebuild
[13:29:02] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus: node_kernel_messages: fix ignore regex file path [puppet] - 10https://gerrit.wikimedia.org/r/1116797 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez)
[13:29:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10516942 (10phaultfinder)
[13:30:31] <wikibugs>	 (03Abandoned) 10Jbond: wmflib: add new functions to update a hash with randome secrets [puppet] - 10https://gerrit.wikimedia.org/r/841479 (owner: 10Jbond)
[13:32:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] dbbackups: Fix dump grants for backup sources and m1 [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo)
[13:32:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73087 and previous config saved to /var/cache/conftool/dbconfig/20250203-133237-root.json
[13:33:06] <wikibugs>	 (03Abandoned) 10Jbond: installserver: add spec test for role [puppet] - 10https://gerrit.wikimedia.org/r/980375 (owner: 10Jbond)
[13:34:08] <logmsgbot>	 !log reedy@deploy2002 reedy: Backport for [[gerrit:1116783|Add missing array_values for PHP 7 compatibility (T385255)]], [[gerrit:1116784|SpecialMathWikibase: Null-coalescence getDescription() call (T385170)]], [[gerrit:1116785|SpecialMathWikibase: Null-coalescence $par (T385269)]], [[gerrit:1116788|ApiQueryContentTranslationSuggestions: Set default value for to and from parameters (T385267)]] synced to the testservers (h
[13:34:08] <logmsgbot>	 ttps://wikitech.wikimedia.org/wiki/Mwdebug)
[13:34:13] <stashbot>	 T385255: Error: Cannot unpack array with string keys - https://phabricator.wikimedia.org/T385255
[13:34:14] <stashbot>	 T385170: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T385170
[13:34:14] <stashbot>	 T385269: PHP Deprecated: str_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385269
[13:34:14] <stashbot>	 T385267: PHP Deprecated: str_replace(): Passing null to parameter #2 ($replace) of type array|string is deprecated - https://phabricator.wikimedia.org/T385267
[13:34:17] <logmsgbot>	 !log reedy@deploy2002 reedy: Continuing with sync
[13:35:18] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116543 (https://phabricator.wikimedia.org/T385205) (owner: 10DLynch)
[13:40:05] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800
[13:41:05] <wikibugs>	 (03PS2) 10Clément Goubert: mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800
[13:43:52] <logmsgbot>	 !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1116783|Add missing array_values for PHP 7 compatibility (T385255)]], [[gerrit:1116784|SpecialMathWikibase: Null-coalescence getDescription() call (T385170)]], [[gerrit:1116785|SpecialMathWikibase: Null-coalescence $par (T385269)]], [[gerrit:1116788|ApiQueryContentTranslationSuggestions: Set default value for to and from parameters (T385267)]] (duration
[13:43:52] <logmsgbot>	 : 16m 43s)
[13:43:58] <stashbot>	 T385255: Error: Cannot unpack array with string keys - https://phabricator.wikimedia.org/T385255
[13:43:58] <stashbot>	 T385170: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T385170
[13:43:58] <stashbot>	 T385269: PHP Deprecated: str_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385269
[13:43:59] <stashbot>	 T385267: PHP Deprecated: str_replace(): Passing null to parameter #2 ($replace) of type array|string is deprecated - https://phabricator.wikimedia.org/T385267
[13:47:02] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: promethes: node_kernel_messages: fix another typo in source file name [puppet] - 10https://gerrit.wikimedia.org/r/1116802 (https://phabricator.wikimedia.org/T380960)
[13:47:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73088 and previous config saved to /var/cache/conftool/dbconfig/20250203-134742-root.json
[13:47:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] promethes: node_kernel_messages: fix another typo in source file name [puppet] - 10https://gerrit.wikimedia.org/r/1116802 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez)
[13:50:03] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] Enable $wgAllowAuthenticatedCrossOrigin on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister)
[13:51:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10517106 (10fgiunchedi) >>! In T384731#10511648, @cmooney wrote: >>>! In T384731#10511163, @fgiunchedi wrote: >> Ye...
[13:54:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:55:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10517133 (10fgiunchedi) SGTM too, re: extracting hostname from interface description we could do it via regexp if the extraction/...
[13:57:22] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur)
[13:59:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1400).
[14:00:04] <jouncebot>	 DreamRimmer and kemayo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:20] <Lucas_WMDE>	 o/
[14:00:20] <DreamRimmer>	  /o
[14:00:41] <Kemayo>	 o/
[14:01:24] <Lucas_WMDE>	 I can deploy today!
[14:02:49] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Change "$wgUploadMissingFileUrl" for svwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer)
[14:03:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer)
[14:03:05] <Lucas_WMDE>	 let’s start with DreamRimmer then
[14:04:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:05:16] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Change "$wgUploadMissingFileUrl" for svwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer)
[14:05:25] <wikibugs>	 (03Merged) 10jenkins-bot: Change "$wgUploadMissingFileUrl" for svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer)
[14:05:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1114663|Change "$wgUploadMissingFileUrl" for svwiktionary (T383452)]]
[14:05:42] <stashbot>	 T383452: Edit "$wgUploadMissingFileUrl" for sv wiktionary - https://phabricator.wikimedia.org/T383452
[14:11:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dreamrimmer: Backport for [[gerrit:1114663|Change "$wgUploadMissingFileUrl" for svwiktionary (T383452)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:11:24] <stashbot>	 T383452: Edit "$wgUploadMissingFileUrl" for sv wiktionary - https://phabricator.wikimedia.org/T383452
[14:11:28] <Lucas_WMDE>	 DreamRimmer: please test :)
[14:11:34] <DreamRimmer>	 doing
[14:13:47] <DreamRimmer>	 looks good
[14:13:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dreamrimmer: Continuing with sync
[14:13:53] <Lucas_WMDE>	 \o/
[14:14:45] <jinxer-wm>	 RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:15:41] <wikibugs>	 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10517235 (10Jhancock.wm)
[14:17:18] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Absolutely, feel free to deploy whenever you have time. I tested manually on the beta cluster and the httpbb tests should cover the rest. " [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński)
[14:19:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1188 T385084', diff saved to https://phabricator.wikimedia.org/P73091 and previous config saved to /var/cache/conftool/dbconfig/20250203-141939-marostegui.json
[14:19:42] <stashbot>	 T385084: Upgrade and rebuild s2 - https://phabricator.wikimedia.org/T385084
[14:19:46] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1188.eqiad.wmnet
[14:20:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114663|Change "$wgUploadMissingFileUrl" for svwiktionary (T383452)]] (duration: 14m 42s)
[14:20:24] <stashbot>	 T383452: Edit "$wgUploadMissingFileUrl" for sv wiktionary - https://phabricator.wikimedia.org/T383452
[14:21:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116543 (https://phabricator.wikimedia.org/T385205) (owner: 10DLynch)
[14:22:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable VisualEditor EditCheck on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116543 (https://phabricator.wikimedia.org/T385205) (owner: 10DLynch)
[14:22:27] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1116543|Enable VisualEditor EditCheck on dewiki (T385205)]]
[14:22:30] <stashbot>	 T385205: [config] Enable Edit Check (References) for all newcomers at de.wiki - https://phabricator.wikimedia.org/T385205
[14:26:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 kemayo, lucaswerkmeister-wmde: Backport for [[gerrit:1116543|Enable VisualEditor EditCheck on dewiki (T385205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:26:11] <Lucas_WMDE>	 Kemayo: please test :)
[14:26:20] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1188.eqiad.wmnet
[14:26:25] <Kemayo>	 Lucas_WMDE: It looks good.
[14:26:31] <Lucas_WMDE>	 ok, thanks!
[14:26:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 kemayo, lucaswerkmeister-wmde: Continuing with sync
[14:26:49] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Index rebuild
[14:30:31] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 1487MiB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[14:33:11] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1116543|Enable VisualEditor EditCheck on dewiki (T385205)]] (duration: 10m 43s)
[14:33:14] <stashbot>	 T385205: [config] Enable Edit Check (References) for all newcomers at de.wiki - https://phabricator.wikimedia.org/T385205
[14:34:58] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Remove /pt from ptwikibooks $wgUploadMissingFileUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116812
[14:35:17] <Lucas_WMDE>	 since we have a bit of time left in the window – anyone want to +1 ^ so I can deploy it? :)
[14:35:26] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Change "$wgUploadMissingFileUrl" for svwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer)
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:42:47] <Lucas_WMDE>	 eh, let’s call the window done and see if anyone reviews that config change later
[14:42:53] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:42:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:22] <wikibugs>	 (03CR) 10Lucas Werkmeister: Enable $wgAllowAuthenticatedCrossOrigin on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister)
[14:57:19] <wikibugs>	 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10517441 (10Marostegui)
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:10:56] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Revert "Horizon: update release version for codfw1dev"" [puppet] - 10https://gerrit.wikimedia.org/r/1116816
[15:11:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "Revert "Horizon: update release version for codfw1dev"" [puppet] - 10https://gerrit.wikimedia.org/r/1116816 (owner: 10Andrew Bogott)
[15:12:01] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485 (10RobH) 03NEW
[15:12:14] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10517536 (10RobH)
[15:14:02] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10517540 (10RobH) a:03BTullis @BTullis   Please review the above task description and checklist to ensure it covers all the hostnames for upgrade to the 8TB HDD and t...
[15:20:24] <wikibugs>	 06SRE, 10Wikimedia-Etherpad, 07SecTeam-Processed, 07Security: Deletion of etherpad - https://phabricator.wikimedia.org/T385356#10517566 (10sbassett)
[15:20:29] <wikibugs>	 06SRE, 10Wikimedia-Etherpad, 07SecTeam-Processed, 07Security: Deletion of etherpad - https://phabricator.wikimedia.org/T385356#10517567 (10sbassett) p:05Triage→03Low
[15:21:22] <wikibugs>	 06SRE, 10Wikimedia-Etherpad, 07SecTeam-Processed, 07Security: Delete etherpad "thispadisnotsecureJustTesting" - https://phabricator.wikimedia.org/T385356#10517572 (10sbassett)
[15:23:22] <wikibugs>	 (03PS1) 10Volans: spicerack: extend run_cookbook() accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1116818
[15:30:31] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 1472MiB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[15:32:28] <wikibugs>	 (03PS1) 10CDanis: tunnelencabulator: add reqctl & bump [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1116821 (https://phabricator.wikimedia.org/T382269)
[15:33:07] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Nice!" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1116821 (https://phabricator.wikimedia.org/T382269) (owner: 10CDanis)
[15:34:45] <wikibugs>	 (03CR) 10CDanis: [V:03+2 C:03+2] tunnelencabulator: add reqctl & bump [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1116821 (https://phabricator.wikimedia.org/T382269) (owner: 10CDanis)
[15:34:56] <wikibugs>	 (03PS2) 10Herron: add aux-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100153 (https://phabricator.wikimedia.org/T381417)
[15:37:26] <wikibugs>	 (03CR) 10CDanis: [C:03+1] add aux-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100153 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[15:37:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db1169.eqiad.wmnet T385141', diff saved to https://phabricator.wikimedia.org/P73093 and previous config saved to /var/cache/conftool/dbconfig/20250203-153755-fceratto.json
[15:37:59] <stashbot>	 T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141
[15:40:01] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: prometheus-node-kernel-messages.sh: don't fail if there are no matches [puppet] - 10https://gerrit.wikimedia.org/r/1116822 (https://phabricator.wikimedia.org/T380960)
[15:40:48] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1169.eqiad.wmnet with reason: provisioning - T385141
[15:41:44] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1251.eqiad.wmnet with reason: provisioning - T385141
[15:43:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus-node-kernel-messages.sh: don't fail if there are no matches [puppet] - 10https://gerrit.wikimedia.org/r/1116822 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez)
[15:45:22] <wikibugs>	 (03PS2) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392)
[15:45:48] <wikibugs>	 (03CR) 10Volans: k8s.pool-depool-node: Add support to downtime/remove downtime (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm)
[15:46:37] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur)
[15:46:37] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur)
[15:48:07] <wikibugs>	 (03CR) 10Herron: [C:03+2] add aux-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100153 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[15:49:38] <wikibugs>	 (03CR) 10Volans: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[15:52:13] <wikibugs>	 (03PS1) 10Sohom Datta: Fix regression with re-enabling button after error [extensions/PageTriage] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116824 (https://phabricator.wikimedia.org/T385355)
[15:52:20] <wikibugs>	 (03Merged) 10jenkins-bot: add aux-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100153 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[15:53:19] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1116825
[15:53:23] <wikibugs>	 (03PS1) 10Elukey: admin_ng: set new Docker images for Knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116826 (https://phabricator.wikimedia.org/T369493)
[15:54:42] <wikibugs>	 (03CR) 10Volans: "Thanks for the patch, couple of considerations inline." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1115767 (owner: 10JMeybohm)
[15:54:48] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/PageTriage] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116824 (https://phabricator.wikimedia.org/T385355) (owner: 10Sohom Datta)
[15:55:46] <wikibugs>	 (03PS1) 10Federico Ceratto: instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141)
[15:56:44] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] varnish: x-analytics: Authorization header summary [puppet] - 10https://gerrit.wikimedia.org/r/1111695 (owner: 10CDanis)
[15:56:46] <wikibugs>	 (03CR) 10Marostegui: [C:04-1] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto)
[15:58:41] <wikibugs>	 (03CR) 10Marostegui: [C:04-1] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto)
[15:59:22] <wikibugs>	 (03CR) 10Marostegui: [C:04-1] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto)
[16:04:07] <wikibugs>	 (03PS2) 10Federico Ceratto: instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141)
[16:04:20] <wikibugs>	 (03PS1) 10Scott French: php8.1: rebuild to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225)
[16:04:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto)
[16:05:10] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto)
[16:12:58] <wikibugs>	 (03PS6) 10Jcrespo: dbbackups: Fix dump grants for backup sources and m1 [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902)
[16:12:58] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Update grants for x1 dump sections too [puppet] - 10https://gerrit.wikimedia.org/r/1116831 (https://phabricator.wikimedia.org/T376916)
[16:21:14] <wikibugs>	 06SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701#10517780 (10Nemoralis) >>! In T336701#9383499, @TheresNoTime wrote: > Atlassian statuspage has [[ https://support.atlassian.com/statuspage/docs/enable-webhook-notifications/ | webhook support ]].. that...
[16:25:37] <wikibugs>	 (03PS7) 10Jcrespo: dbbackups: Fix dump grants for backup sources and m1 [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902)
[16:27:00] <wikibugs>	 (03PS3) 10Herron: aux_k8s: apply etcd_aux_k8s role to aux-k8s-etcd200[345] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1116825 (https://phabricator.wikimedia.org/T381417)
[16:28:38] <wikibugs>	 (03CR) 10CDanis: [C:03+1] aux_k8s: apply etcd_aux_k8s role to aux-k8s-etcd200[345] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1116825 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[16:30:05] <jouncebot>	 jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1630)
[16:34:14] <wikibugs>	 (03PS3) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392)
[16:34:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10517875 (10phaultfinder)
[16:34:45] <wikibugs>	 (03PS2) 10Elukey: admin_ng: set new Docker images for Knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116826 (https://phabricator.wikimedia.org/T369493)
[16:34:45] <wikibugs>	 (03PS1) 10Elukey: kartotherian: update Docker image and geoshapes yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530)
[16:36:21] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur)
[16:37:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Add db1251.eqiad.wmnet T385141', diff saved to https://phabricator.wikimedia.org/P73096 and previous config saved to /var/cache/conftool/dbconfig/20250203-163722-fceratto.json
[16:37:25] <stashbot>	 T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141
[16:37:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73097 and previous config saved to /var/cache/conftool/dbconfig/20250203-163727-root.json
[16:37:41] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudgw1003: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356)
[16:38:10] <wikibugs>	 (03CR) 10Jgiannelos: [C:04-1] kartotherian: update Docker image and geoshapes yaml config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey)
[16:38:43] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:39:03] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:40:10] <wikibugs>	 (03CR) 10Elukey: kartotherian: update Docker image and geoshapes yaml config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey)
[16:41:11] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:41:57] <wikibugs>	 (03CR) 10Jgiannelos: [C:04-1] kartotherian: update Docker image and geoshapes yaml config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey)
[16:42:08] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] php8.1: rebuild to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225) (owner: 10Scott French)
[16:42:40] <wikibugs>	 (03PS1) 10Daimona Eaytoy: core-Permissions: drop redundant CampaignEvents right assignments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116834 (https://phabricator.wikimedia.org/T376822)
[16:43:10] <wikibugs>	 (03PS2) 10Elukey: kartotherian: update Docker image and geoshapes yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530)
[16:43:10] <wikibugs>	 (03PS3) 10Elukey: admin_ng: set new Docker images for Knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116826 (https://phabricator.wikimedia.org/T369493)
[16:43:28] <wikibugs>	 (03CR) 10Elukey: kartotherian: update Docker image and geoshapes yaml config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey)
[16:43:50] <wikibugs>	 (03PS2) 10Daimona Eaytoy: core-Permissions: drop redundant CampaignEvents right assignments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116834 (https://phabricator.wikimedia.org/T376822)
[16:44:22] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1169.eqiad.wmnet onto db1251.eqiad.wmnet
[16:44:36] <wikibugs>	 (03CR) 10Scott French: [C:03+1] deployment_server: Don't choke on 'Extension:scriptname' in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1116535 (https://phabricator.wikimedia.org/T380533) (owner: 10RLazarus)
[16:45:33] <swfrench-wmf>	 jouncebot: nowandnext
[16:45:33] <jouncebot>	 For the next 0 hour(s) and 14 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1630)
[16:45:33] <jouncebot>	 In 1 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1800)
[16:45:33] <jouncebot>	 In 1 hour(s) and 14 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1800)
[16:46:03] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] kartotherian: update Docker image and geoshapes yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey)
[16:46:07] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:46:37] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:46:53] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:46:56] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kartotherian: update Docker image and geoshapes yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116833 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey)
[16:52:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73099 and previous config saved to /var/cache/conftool/dbconfig/20250203-165232-root.json
[16:54:28] <wikibugs>	 (03PS4) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392)
[16:54:38] <wikibugs>	 (03PS1) 10Effie Mouzeli: shellbox: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116837 (https://phabricator.wikimedia.org/T377038)
[16:54:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10517991 (10Papaul) @VRiley-WMF it looks like all those servers are connected to 1G. Can you please move them to 10G ports and update the task i can h...
[16:56:17] <wikibugs>	 (03PS1) 10Effie Mouzeli: shellbox-media: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116838 (https://phabricator.wikimedia.org/T377038)
[16:57:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] php8.1: rebuild to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225) (owner: 10Scott French)
[16:57:19] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudgw1003: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356)
[16:57:19] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudgw1004: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356)
[16:57:45] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[16:58:22] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[17:01:46] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10518010 (10aborrero) >>! In T382412#10512402, @cmooney wrote: > I'm guessing you're gonna migrate by removing on...
[17:01:58] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur)
[17:03:35] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudgw1003: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356)
[17:03:35] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudgw1004: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356)
[17:07:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73100 and previous config saved to /var/cache/conftool/dbconfig/20250203-170737-root.json
[17:13:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T384592)', diff saved to https://phabricator.wikimedia.org/P73101 and previous config saved to /var/cache/conftool/dbconfig/20250203-171322-marostegui.json
[17:13:26] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[17:20:31] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] Enable $wgAllowAuthenticatedCrossOrigin on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister)
[17:22:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73102 and previous config saved to /var/cache/conftool/dbconfig/20250203-172243-root.json
[17:23:48] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10518139 (10Jhancock.wm) i got some instructions from dell. kind of similar to what we tried with some extra cables to reset. Is this server still depooled?
[17:24:45] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 5/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:24:45] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:24:45] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:24:45] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:24:45] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:24:56] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10518143 (10MatthewVernon) Yeah, you can work on this server any time, but thanks for checking :)
[17:25:45] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:25:45] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:27:42] <wikibugs>	 (03PS1) 10CDanis: chart-renderer: new new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116843 (https://phabricator.wikimedia.org/T383748)
[17:27:43] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:27:43] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:27:45] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:28:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P73103 and previous config saved to /var/cache/conftool/dbconfig/20250203-172829-marostegui.json
[17:32:15] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:32:38] <wikibugs>	 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 10database-backups, and 2 others: decommission db2139 - https://phabricator.wikimedia.org/T383971#10518217 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:32:47] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:36:58] <wikibugs>	 (03CR) 10Aude: [C:03+1] chart-renderer: new new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116843 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis)
[17:37:04] <wikibugs>	 (03CR) 10CDanis: [C:03+2] chart-renderer: new new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116843 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis)
[17:37:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73104 and previous config saved to /var/cache/conftool/dbconfig/20250203-173748-root.json
[17:38:20] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Fix dump grants for backup sources and m1 [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo)
[17:38:21] <wikibugs>	 (03Merged) 10jenkins-bot: chart-renderer: new new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116843 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis)
[17:39:16] <logmsgbot>	 !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply
[17:39:52] <logmsgbot>	 !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply
[17:40:51] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Update grants for x1 dump sections too [puppet] - 10https://gerrit.wikimedia.org/r/1116831 (https://phabricator.wikimedia.org/T376916)
[17:40:51] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Update grants for misc hosts other than m1 [puppet] - 10https://gerrit.wikimedia.org/r/1116845 (https://phabricator.wikimedia.org/T383902)
[17:40:53] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Remove last references to dbprov[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/1116846 (https://phabricator.wikimedia.org/T383902)
[17:42:58] <wikibugs>	 (03CR) 10Jcrespo: "This is technically a production change, just happens to be part of the backup app." [puppet] - 10https://gerrit.wikimedia.org/r/1116846 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo)
[17:43:25] <wikibugs>	 (03CR) 10Jcrespo: "This is pending to be deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1116845 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo)
[17:43:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P73105 and previous config saved to /var/cache/conftool/dbconfig/20250203-174336-marostegui.json
[17:44:09] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Update grants for x1 dump sections too [puppet] - 10https://gerrit.wikimedia.org/r/1116831 (https://phabricator.wikimedia.org/T376916) (owner: 10Jcrespo)
[17:44:29] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] "Merging this as this was deployed at the same time than the previous backup source change." [puppet] - 10https://gerrit.wikimedia.org/r/1116831 (https://phabricator.wikimedia.org/T376916) (owner: 10Jcrespo)
[17:46:10] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply
[17:46:42] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply
[17:46:45] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply
[17:47:12] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply
[17:57:38] <wikibugs>	 (03PS5) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392)
[17:58:31] <urbanecm>	 !log [urbanecm@deploy2002 ~]$ mwscript-k8s -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=newiki --logwiki=metawiki 'JOestby' 'Johannesoestby' # T385503
[17:58:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:34] <stashbot>	 T385503: Unblock stuck global renames - https://phabricator.wikimedia.org/T385503
[17:58:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T384592)', diff saved to https://phabricator.wikimedia.org/P73106 and previous config saved to /var/cache/conftool/dbconfig/20250203-175843-marostegui.json
[17:58:46] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[17:58:59] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1238.eqiad.wmnet with reason: Maintenance
[17:59:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T384592)', diff saved to https://phabricator.wikimedia.org/P73107 and previous config saved to /var/cache/conftool/dbconfig/20250203-175904-marostegui.json
[17:59:18] <urbanecm>	 [urbanecm@deploy2002 ~]$ mwscript-k8s -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=newiki --logwiki=metawiki 'Tarasssst' 'TR101' # T385503
[17:59:24] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur)
[18:00:04] <jouncebot>	 swfrench-wmf: Time to snap out of that daydream and deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1800).
[18:00:04] <jouncebot>	 ryankemper: Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T1800). Please do the needful.
[18:00:20] <swfrench-wmf>	 o/
[18:01:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115966 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:01:31] <urbanecm>	 !log [urbanecm@deploy2002 ~]$ mwscript-k8s -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=newiki --logwiki=metawiki 'Tarasssst' 'TR101' # T385503
[18:01:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:41] <urbanecm>	 that's why the log entry didn't make it to the task... missing !_log
[18:02:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enroll 10% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115966 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:02:31] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1115966|Enroll 10% of client sessions in PHP 8.1 (T383845)]]
[18:02:34] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[18:06:19] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1115966|Enroll 10% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:07:18] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Continuing with sync
[18:13:45] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115966|Enroll 10% of client sessions in PHP 8.1 (T383845)]] (duration: 11m 13s)
[18:13:48] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[18:19:17] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-api-int: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115972 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:20:26] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-int: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115972 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:25:28] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[18:27:06] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[18:27:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:27:49] <wikibugs>	 (03CR) 10Herron: [C:03+2] aux_k8s: apply etcd_aux_k8s role to aux-k8s-etcd200[345] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1116825 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[18:28:51] <swfrench-wmf>	 !log mw-api-int to ~ 1% of traffic on PHP 8.1 in eqiad - T383845
[18:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:54] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[18:29:39] <wikibugs>	 (03PS3) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor)
[18:29:40] <wikibugs>	 (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor)
[18:32:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:37:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:38:28] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[18:39:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10518605 (10phaultfinder)
[18:39:41] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[18:41:29] <swfrench-wmf>	 !log mw-api-int to ~ 1% of traffic on PHP 8.1 in codfw - T383845
[18:41:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:31] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[18:42:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: etcd.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:45:19] <cdanis>	 ^ expected, being turned up by herron 
[18:45:56] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[18:46:11] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[18:47:55] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: etcd.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:48:10] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: etcd.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:48:50] <herron>	 I'll enter some sliences
[18:50:30] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1169.eqiad.wmnet onto db1251.eqiad.wmnet
[18:50:52] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[18:51:07] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[18:52:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: etcd.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:55:55] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] enwiki: Enable mentorship for 75% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116853 (https://phabricator.wikimedia.org/T384505)
[18:55:55] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[19:00:09] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt fransw1001 - vriley@cumin1002"
[19:00:13] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt fransw1001 - vriley@cumin1002"
[19:00:13] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:02:21] <icinga-wm>	 PROBLEM - Check unit status of etcd-backup on aux-k8s-etcd2003 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:06:23] <icinga-wm>	 PROBLEM - Check unit status of etcd-backup on aux-k8s-etcd2004 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:10:25] <icinga-wm>	 PROBLEM - Check unit status of etcd-backup on aux-k8s-etcd2005 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:11:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10518711 (10herron) >>! In T381417#10518550, @gerritbot wrote: > Change #1116825 **merged** by Herron: > %%%[operations/puppet@production] aux_k8s...
[19:17:37] <swfrench-wmf>	 jouncebot: nowandnext
[19:17:37] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 42 minute(s)
[19:17:37] <jouncebot>	 In 1 hour(s) and 42 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T2100)
[19:19:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10518739 (10phaultfinder)
[19:19:57] <icinga-wm>	 RECOVERY - Host analytics1073 is UP: PING WARNING - Packet loss = 33%, RTA = 0.93 ms
[19:21:15] <swfrench-wmf>	 unless there are any objections, I'd like to use this quiet spot to deploy a fix for T385225, which will require a scap deployment in order to pick up a new base image
[19:21:16] <stashbot>	 T385225: Mercurius does not retry failed transcodes beyond 15m - https://phabricator.wikimedia.org/T385225
[19:26:21] <icinga-wm>	 PROBLEM - Host analytics1073 is DOWN: PING CRITICAL - Packet loss = 100%
[19:26:45] <wikibugs>	 (03PS1) 10Dwisehaupt: Another CNAME for acoustic landing pages [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931)
[19:27:19] <wikibugs>	 (03CR) 10Dwisehaupt: "One last one needed before acoustic can shift the site over to the new cert." [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt)
[19:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10518765 (10phaultfinder)
[19:31:01] <wikibugs>	 (03CR) 10Dwisehaupt: [C:04-2] "Acoustic provided the wrong info. Updating with the correct info in a sec." [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt)
[19:31:38] <rzl>	 swfrench-wmf: no rush but whenever you're done with that I'll roll out an apache config change
[19:31:52] <rzl>	 (if there's time before the 21:00 backport window, and if not I can just do it later)
[19:32:20] <swfrench-wmf>	 rzl: ack, thanks! feel free to go ahead, actually - I appear to have stepped on a reprepro rake :)
[19:32:45] <rzl>	 sure, going! good luck with your rake
[19:32:58] <wikibugs>	 (03CR) 10RLazarus: [V:03+1 C:03+2] Use new 'auth' docroot for the auth domain [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński)
[19:37:56] <rzl>	 works on metal mwdebug, scapping
[19:38:36] <swfrench-wmf>	 rzl: I think I've got my end sorted, so you'll see me doing a bit of work in the background, but nothing that should actually change production until you're done :)
[19:38:57] <rzl>	 👍
[19:39:11] <rzl>	 thanks for the heads up! I'll let you know when I'm hands-off
[19:40:19] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:40:31] <swfrench-wmf>	 !log ran reprepro include mercurius 1.1.0-1 - T385225
[19:40:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:34] <stashbot>	 T385225: Mercurius does not retry failed transcodes beyond 15m - https://phabricator.wikimedia.org/T385225
[19:40:47] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:43:23] <MatmaRex>	 oh, thanks for deploying that :)
[19:45:59] <rzl>	 sure thing :) thanks for your patience
[19:46:50] <wikibugs>	 (03CR) 10Lucas Werkmeister: Enable $wgAllowAuthenticatedCrossOrigin on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister)
[19:46:58] <logmsgbot>	 !log rzl@deploy2002 Started scap sync-world: T383952, T384137
[19:47:02] <stashbot>	 T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952
[19:47:02] <stashbot>	 T384137: Set up robots.txt in auth.wikimedia.org - https://phabricator.wikimedia.org/T384137
[19:50:16] <rzl>	 MatmaRex: hmm, deployed to the testervers but the new httpbb tests for robots.txt and favicon.ico are failing
[19:50:23] <rzl>	 https://www.irccloud.com/pastebin/LI4ZYclY/
[19:50:37] <rzl>	 I'm taking a look but since you're around, in case you want to dig :)
[19:50:52] <MatmaRex>	 rzl: could be cached
[19:51:08] <rzl>	 on the appserver?
[19:51:30] <MatmaRex>	 oh
[19:52:27] <MatmaRex>	 yeah, that's not right
[19:53:01] <wikibugs>	 (03PS2) 10Lucas Werkmeister: Enable $wgAllowAuthenticatedCrossOrigin on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944)
[19:53:01] <wikibugs>	 (03PS1) 10Lucas Werkmeister: DNM: Enable $wgAllowAuthenticatedCrossOrigin on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116860 (https://phabricator.wikimedia.org/T322944)
[19:53:21] <MatmaRex>	 it looks like instead of serving the files directly, it's being rewritten to… static.php, probably?
[19:53:29] <wikibugs>	 (03CR) 10Lucas Werkmeister: [C:04-1] "do not merge yet, only uploading this because I already wanted to write the comment down" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116860 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister)
[19:53:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] DNM: Enable $wgAllowAuthenticatedCrossOrigin on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116860 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister)
[19:54:36] <wikibugs>	 (03PS2) 10Lucas Werkmeister: DNM: Enable $wgAllowAuthenticatedCrossOrigin on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116860 (https://phabricator.wikimedia.org/T322944)
[19:54:49] <rzl>	 the tests are passing on mwdebug2001.codfw.wmnet, but not on mwdebug.discovery.wmnet:4444, so normally I would think we didn't bring the config change into the k8s side correctly
[19:54:59] <rzl>	 except that the diffs did look correct
[19:57:01] <rzl>	 e.g. (only including mw-web-main, the others all looked the same) https://www.irccloud.com/pastebin/mA7qkGKv/
[19:57:31] <rzl>	 cc swfrench-wmf in case you can spot something I missed
[19:57:40] <swfrench-wmf>	 looking
[19:58:14] <rzl>	 all the extra +ServerName -ServerNames do cancel out correctly, one of those cases where the minimal lines diff isn't the same as the minimal semantic one
[20:00:55] <rzl>	 also curious that it's a 500 for mwdebug but a 503 for mwdebug-next
[20:00:57] <MatmaRex>	 rzl: it seems to me that there should be a diff for the lines controlled by "public_rewrites: false", almost at the very bottom, but there isn't
[20:01:08] <rzl>	 (which I confirm from the browser)
[20:03:29] <MatmaRex>	 the behavior sure looks like "public_rewrites: false" is not having any effect. did i put it in the wrong place or something?
[20:03:30] <rzl>	 MatmaRex: hmm, true
[20:03:57] <rzl>	 it did have the correct effect on the bare-metal hosts but not in the k8s version, I wonder if we do the templating differently there
[20:04:22] <rzl>	 okay, I'm inclined to revert and try again another time -- any data you want to collect first?
[20:04:48] <MatmaRex>	 not really
[20:04:51] <rzl>	 (this isn't causing any harm and we can keep looking, I just want to relinquish the conch and unblock swfrench-wmf's other thing)
[20:05:34] <rzl>	 ^ also neat, I exited scap and I guess that killed logmsgbot 
[20:06:25] <rzl>	 meanwhile, good job writing the tests :)
[20:06:34] <wikibugs>	 (03PS1) 10RLazarus: Revert "Use new 'auth' docroot for the auth domain" [puppet] - 10https://gerrit.wikimedia.org/r/1116862
[20:06:56] <swfrench-wmf>	 if you all need more time for debugging, that's totally fine - my change shouldn't take _too_ long to get out (aside from surprises that result in slow image builds)
[20:07:37] <rzl>	 nah I think we have what we need -- we might be able to make another try after you're done, depending
[20:07:47] <MatmaRex>	 rzl: heh, yeah, thanks for making me add them ;)
[20:07:51] <rzl>	 just awaiting slowkins
[20:08:06] <MatmaRex>	 do you have any idea why this would work differently under kubernetes? (i don't)
[20:08:44] <rzl>	 specifically no, but we reproduced a lot of the templating logic and I wouldn't be stunned if they inadvertently handle that variable differently -- I can start digging once this is rolled back
[20:08:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Use new 'auth' docroot for the auth domain" [puppet] - 10https://gerrit.wikimedia.org/r/1116862 (owner: 10RLazarus)
[20:08:48] <jinxer-wm>	 FIRING: [2x] PuppetFailure: Puppet has failed on aux-k8s-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:09:26] <rzl>	 ughhhhh yes okay I didn't word-wrap my own revert message in the gerrit box
[20:09:33] <rzl>	 you're so right to save production from my recklessness
[20:09:45] <swfrench-wmf>	 we can't have that!
[20:10:20] <wikibugs>	 (03PS2) 10RLazarus: Revert "Use new 'auth' docroot for the auth domain" [puppet] - 10https://gerrit.wikimedia.org/r/1116862
[20:10:46] <wikibugs>	 (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor)
[20:12:40] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] Revert "Use new 'auth' docroot for the auth domain" [puppet] - 10https://gerrit.wikimedia.org/r/1116862 (owner: 10RLazarus)
[20:13:48] <jinxer-wm>	 FIRING: [3x] PuppetFailure: Puppet has failed on aux-k8s-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:21:52] <rzl>	 rolled back successfully on mwdebug, running puppet on deploy2002 now and then scapping
[20:23:46] <MatmaRex>	 rzl: btw, i wonder, is there a kubernets-based environemnt on the beta cluster? i am wondering if i could have caught this problem before the production deployment
[20:25:29] <rzl>	 I don't know :)
[20:26:47] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the reviews!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225) (owner: 10Scott French)
[20:27:04] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Verified to build locally." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225) (owner: 10Scott French)
[20:27:30] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: rebuild to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1116827 (https://phabricator.wikimedia.org/T385225) (owner: 10Scott French)
[20:27:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:29:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10518948 (10phaultfinder)
[20:29:46] <rzl>	 just for completeness, since I excerpted the original diffs, here's the full output from the rollback, since there's only a diff against mw-debug and friends -- just note the +s and -s are reversed since it's a revert https://www.irccloud.com/pastebin/jiq3q4CW/
[20:29:56] <logmsgbot>	 !log rzl@deploy2002 Started scap sync-world: T383952, T384137
[20:30:00] <stashbot>	 T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952
[20:30:00] <stashbot>	 T384137: Set up robots.txt in auth.wikimedia.org - https://phabricator.wikimedia.org/T384137
[20:30:05] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:30:32] <rzl>	 ^ expected from httpbb version skew with this puppet change, will self-resolve
[20:31:53] <logmsgbot>	 !log rzl@deploy2002 rzl: T383952, T384137 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:32:05] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:32:15] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:32:35] <logmsgbot>	 !log rzl@deploy2002 rzl: Continuing with sync
[20:32:49] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:33:11] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:33:35] <logmsgbot>	 !log rzl@deploy2002 Finished scap sync-world: T383952, T384137 (duration: 06m 10s)
[20:34:01] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:34:02] <rzl>	 swfrench-wmf: all yours
[20:34:12] <swfrench-wmf>	 rzl: ack, thank you!
[20:34:32] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:34:43] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:34:43] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:34:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:35:39] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:35:55] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:37:07] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:39:08] <wikibugs>	 (03PS1) 10Herron: wmnet: add codfw aux-k8s-etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1116867 (https://phabricator.wikimedia.org/T381417)
[20:48:58] <wikibugs>	 06SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701#10519011 (10Nemoralis) It looks like there is https://fox.nexus/@wikistatus run by @TheresNoTime
[20:51:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116853 (https://phabricator.wikimedia.org/T384505) (owner: 10Urbanecm)
[20:54:50] <wikibugs>	 (03PS1) 10Andrew Bogott: haproxy/keystone: change balance algorithm to 'source' for public keystone [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370)
[20:55:17] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott)
[20:59:45] <wikibugs>	 (03PS2) 10Andrew Bogott: haproxy/keystone: change balance algorithm to 'source' for public keystone [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370)
[20:59:53] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott)
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T2100)
[21:00:05] <jouncebot>	 Sohom_Datta and urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:12] <urbanecm>	 i can deploy today
[21:00:32] <urbanecm>	 swfrench-wmf: rzl: i noticed you did something mw related recently, all done?
[21:01:25] <swfrench-wmf>	 urbanecm: so, work on my end is a bit stuck at the moment, but at the specific point it's stuck, you should be able to proceed with your backport
[21:01:38] <urbanecm>	 ack, ty
[21:01:51] <swfrench-wmf>	 if you could check in with me before you proceed to the 2nd one, that would be greatly appreciated
[21:01:59] <urbanecm>	 sure
[21:02:04] <urbanecm>	 i don't see Sohom_Datta in here
[21:02:07] <urbanecm>	 so i'll just do the config now
[21:02:34] <swfrench-wmf>	 sounds good
[21:02:34] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [Growth] enwiki: Enable mentorship for 75% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116853 (https://phabricator.wikimedia.org/T384505) (owner: 10Urbanecm)
[21:03:19] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] enwiki: Enable mentorship for 75% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116853 (https://phabricator.wikimedia.org/T384505) (owner: 10Urbanecm)
[21:04:21] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! Applies cleanly locally on git master." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1104740 (owner: 10Pppery)
[21:04:34] <wikibugs>	 (03PS3) 10Andrew Bogott: haproxy/keystone: change balance algorithm to 'source' for public keystone [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370)
[21:04:37] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott)
[21:04:49] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1116853|[Growth] enwiki: Enable mentorship for 75% of new accounts (T384505)]]
[21:04:52] <stashbot>	 T384505: Increase the number of new accounts getting a mentor at English Wikipedia - https://phabricator.wikimedia.org/T384505
[21:08:33] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1116853|[Growth] enwiki: Enable mentorship for 75% of new accounts (T384505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:08:40] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[21:09:38] <wikibugs>	 (03PS4) 10Andrew Bogott: haproxy/keystone: change balance algorithm to 'source' for public keystone [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370)
[21:10:06] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott)
[21:14:08] <wikibugs>	 (03PS5) 10Andrew Bogott: haproxy/keystone: change balance algorithm to 'source' for public keystone [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370)
[21:14:16] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott)
[21:15:11] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1116853|[Growth] enwiki: Enable mentorship for 75% of new accounts (T384505)]] (duration: 10m 22s)
[21:15:14] <stashbot>	 T384505: Increase the number of new accounts getting a mentor at English Wikipedia - https://phabricator.wikimedia.org/T384505
[21:15:41] <urbanecm>	 still no sign of Sohom_Datta
[21:15:53] <urbanecm>	 swfrench-wmf: done
[21:16:04] <swfrench-wmf>	 urbanecm: ack, thanks!
[21:24:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10519118 (10phaultfinder)
[21:27:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] "self-merging because I have users who are struggling with this right now." [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott)
[21:30:05] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:30:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10519135 (10phaultfinder)
[21:32:08] <jinxer-wm>	 RESOLVED: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:32:15] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:34:43] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:34:43] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:34:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:37:07] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:48:44] <wikibugs>	 (03PS2) 10Dwisehaupt: Shift CNAME for acoustic landing pages [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931)
[21:49:14] <wikibugs>	 (03CR) 10Dwisehaupt: Shift CNAME for acoustic landing pages [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt)
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: That opportune time for a Weekly Security deployment window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250203T2200).
[22:09:57] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Shift CNAME for acoustic landing pages [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt)
[22:11:04] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wmnet: add codfw aux-k8s-etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1116867 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[22:13:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#10519204 (10VRiley-WMF)
[22:15:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#10519206 (10VRiley-WMF) a:03Jgreen Hey Jeff  I was able to connect this unit up with 10G. It should now be ready to go for you. Let me know if you need anyt...
[22:39:59] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[22:43:57] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt franio1001 - vriley@cumin1002"
[22:44:02] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt franio1001 - vriley@cumin1002"
[22:44:02] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:52:26] <wikibugs>	 (03PS4) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074
[22:52:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[22:54:22] <wikibugs>	 (03PS5) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074
[22:55:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[22:55:53] <wikibugs>	 (03Abandoned) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 (owner: 10Jdlrobson)
[23:01:24] <wikibugs>	 (03PS6) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074
[23:01:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[23:01:53] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[23:02:21] <wikibugs>	 (03PS7) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074
[23:03:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[23:08:09] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+2] "Verbal confirmation from jgreen also." [dns] - 10https://gerrit.wikimedia.org/r/1116857 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt)
[23:08:19] <logmsgbot>	 !log dwisehaupt@dns1004 START - running authdns-update
[23:09:53] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt franio1002 - vriley@cumin1002"
[23:09:57] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt franio1002 - vriley@cumin1002"
[23:09:57] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:10:14] <logmsgbot>	 !log dwisehaupt@dns1004 END - running authdns-update
[23:14:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T384592)', diff saved to https://phabricator.wikimedia.org/P73108 and previous config saved to /var/cache/conftool/dbconfig/20250203-231428-marostegui.json
[23:14:31] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[23:29:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P73109 and previous config saved to /var/cache/conftool/dbconfig/20250203-232933-marostegui.json
[23:32:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10519367 (10VRiley-WMF) a:03Jgreen
[23:33:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10519369 (10VRiley-WMF) All these should be set and ready to go! @Jgreen Let me know if you need anything else!
[23:44:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P73110 and previous config saved to /var/cache/conftool/dbconfig/20250203-234440-marostegui.json
[23:59:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T384592)', diff saved to https://phabricator.wikimedia.org/P73111 and previous config saved to /var/cache/conftool/dbconfig/20250203-235947-marostegui.json
[23:59:51] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592