[00:06:26] RESOLVED: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:03] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [00:37:28] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [00:38:14] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [00:38:39] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [01:00:50] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:12:25] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 35s) [01:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:42:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:18:59] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:43:59] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:09:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:31:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:31:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:34:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:41:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:42:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T0600) [06:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2011.codfw.wmnet with OS bookworm [06:35:58] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11188113 (10Joe) 05Open→03Resolved I will tentatively close this task for now. [06:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:25] FIRING: [2x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:46:04] maps2012 is expected, I'll down it and 2013/2014 [06:46:25] FIRING: [3x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:19] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps[2012-2014].codfw.wmnet with reason: in setup [06:50:20] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [06:54:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [06:58:00] 06SRE, 10Hiddenparma, 06Traffic: Better mapping of requests coming from datacenters/clouds - https://phabricator.wikimedia.org/T400120#11188127 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [07:00:04] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T0700). [07:00:04] sergi0: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] o/ [07:00:24] I can self-deploy [07:01:17] sergi0: how much time do you estimate? [07:01:41] 3 min or less, it's a labs only change [07:02:02] is it ok? [07:02:29] yes thans! [07:02:57] !log upgrading Envoy on debmonitor T403663 [07:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:02] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [07:05:56] @effie all done [07:06:03] thank you [07:07:18] thanks, syncing my patch now [07:09:27] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1189038|hCaptcha: Enable on phase 1 wikis (T402366)]] [07:09:33] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:11:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:13:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2011.codfw.wmnet with OS bookworm [07:15:25] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1189038|hCaptcha: Enable on phase 1 wikis (T402366)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:15:29] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:16:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.756 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:16:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.927 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:24:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2012.codfw.wmnet with OS bookworm [07:25:11] !log kharlan@deploy1003 kharlan: Continuing with sync [07:30:35] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189038|hCaptcha: Enable on phase 1 wikis (T402366)]] (duration: 21m 08s) [07:30:40] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:32:15] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:32:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/2 (Transport: cr2-codfw:xe-0/1/1:1 (Lumen, 442550293) {#5249}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:34:10] FIRING: BFDdown: BFD session down between cr2-eqiad and 208.80.154.215 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:36:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:36:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:37:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/2 (Transport: cr2-codfw:xe-0/1/1:1 (Lumen, 442550293) {#5249}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:39:10] RESOLVED: BFDdown: BFD session down between cr2-eqiad and 208.80.154.215 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:41:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:41:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:41:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11188195 (10elukey) For cp2050 I keep getting this: ` GET https://10.193.3.234/redfish/v1/TaskService/TaskMonitors/JID_580944559377 returned HTTP 400 Response... [07:43:37] done syncing [07:43:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2012.codfw.wmnet with reason: host reimage [07:43:59] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:49:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2012.codfw.wmnet with reason: host reimage [07:49:32] elukey@cumin1003 provision (PID 2932716) is awaiting input [07:51:36] jouncebot: next [07:51:36] In 2 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1000) [07:52:50] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:53:10] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2051.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:56:52] doing another backport [08:03:11] this is me, ignore v [08:05:30] PROBLEM - MariaDB read only db_inventory #page on db2185 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.11.13-MariaDB-log, Uptime 570363s, event_scheduler: True, 100.34 QPS, connection latency: 0.030052s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:05:40] This is a test, ignore ^ [08:06:05] <_joe_> jynus: test successful? :) [08:06:23] yes [08:06:30] RECOVERY - MariaDB read only db_inventory #page on db2185 is OK: Version 10.11.13-MariaDB-log, Uptime 570423s, read_only: True, event_scheduler: True, 98.51 QPS, connection latency: 0.029999s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:08:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2012.codfw.wmnet with OS bookworm [08:10:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2013.codfw.wmnet with OS bookworm [08:13:03] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2051.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:15:37] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2052.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:22:46] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1189108|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189107|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189106|hCaptcha: Remove non-existent message]], [[gerrit:1189105|hCaptcha: Remove non-existent message]] [08:22:52] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [08:27:58] !log upgrading Envoy on IDM hosts T403663 [08:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:02] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [08:28:42] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1189108|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189107|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189106|hCaptcha: Remove non-existent message]], [[gerrit:1189105|hCaptcha: Remove non-existent message]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:28:46] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [08:28:58] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2013.codfw.wmnet with reason: host reimage [08:30:32] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2052.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:30:49] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [08:30:54] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2053.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:31:07] !log kharlan@deploy1003 kharlan: Continuing with sync [08:33:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2013.codfw.wmnet with reason: host reimage [08:33:30] !log restart pybal on lvs1019/lvs2013/lvs2014 to clear out alert [08:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:38] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:36:28] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189108|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189107|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189106|hCaptcha: Remove non-existent message]], [[gerrit:1189105|hCaptcha: Remove non-existent message]] (duration: 13m 41s) [08:36:32] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [08:37:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:38:54] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [08:40:16] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [08:42:18] !log upgrading Envoy on deployment hosts T403663 [08:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:23] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [08:43:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:44:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:50:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11188395 (10elukey) cp2051 worked, cp2052 showed the issue, cp2053 worked. [08:50:54] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2053.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:52:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2013.codfw.wmnet with OS bookworm [08:53:29] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2054.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:54:14] (03PS2) 10Muehlenhoff: Remove obsolete ganeti_init.sh script [puppet] - 10https://gerrit.wikimedia.org/r/1189117 [08:56:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:57:52] (03CR) 10Stevemunene: [C:03+1] deployment_server: restore service private files ownership [puppet] - 10https://gerrit.wikimedia.org/r/1188795 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [08:59:40] jouncebot: nowandnext [08:59:40] No deployments scheduled for the next 1 hour(s) and 0 minute(s) [08:59:41] In 1 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1000) [09:00:45] (03CR) 10Ladsgroup: [C:03+1] instances.yaml: add es2049 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1188769 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:01:20] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: add es2049 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1188769 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:02:54] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:05:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:06:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:06:49] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:06:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:11:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump weight of db1206 in general group (T403966)', diff saved to https://phabricator.wikimedia.org/P83384 and previous config saved to /var/cache/conftool/dbconfig/20250917-091137-ladsgroup.json [09:11:43] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [09:14:28] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2054.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:15:09] (03PS1) 10Elukey: WIP: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) [09:16:26] (03CR) 10Brouberol: [V:03+1 C:03+2] deployment_server: restore service private files ownership [puppet] - 10https://gerrit.wikimedia.org/r/1188795 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [09:17:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2014.codfw.wmnet with OS bookworm [09:17:10] (03PS2) 10Elukey: WIP: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) [09:17:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:18:34] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:19:19] (03CR) 10Elukey: [C:03+1] Re-add maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1189110 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:19:25] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:19:33] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:19:53] (03PS1) 10Cathal Mooney: Revert "cephosd: un-set bird bgp neighbors rather than override for each host" [puppet] - 10https://gerrit.wikimedia.org/r/1189119 [09:20:21] (03CR) 10CI reject: [V:04-1] Revert "cephosd: un-set bird bgp neighbors rather than override for each host" [puppet] - 10https://gerrit.wikimedia.org/r/1189119 (owner: 10Cathal Mooney) [09:20:33] !log elukey@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw [09:21:56] RECOVERY - Kafka broker TLS certificate validity on kafka-main2006 is OK: SSL OK - Certificate kafka-main2006.codfw.wmnet valid until 2026-08-23 08:25:00 +0000 (expires in 339 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:23:00] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:24:15] (03PS1) 10Hnowlan: (api|rest)-gateway: move Via header definition to response [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189121 (https://phabricator.wikimedia.org/T401396) [09:24:19] (03PS2) 10Cathal Mooney: Cephosd: revert to manually setting up peering IPs [puppet] - 10https://gerrit.wikimedia.org/r/1189119 [09:24:23] (03CR) 10CI reject: [V:04-1] (api|rest)-gateway: move Via header definition to response [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189121 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [09:24:34] (03CR) 10CI reject: [V:04-1] WIP: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [09:25:20] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:26:12] (03PS2) 10Hnowlan: (api|rest)-gateway: move Via header definition to response [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189121 (https://phabricator.wikimedia.org/T401396) [09:26:46] (03CR) 10Muehlenhoff: [C:03+2] imposm-initial-import: Fix check whether imposm is running [puppet] - 10https://gerrit.wikimedia.org/r/1188801 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:26:53] (03PS2) 10Marco Fossati: ReaderExperiments' ImageBrowsing stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403259) [09:27:44] (03CR) 10Marco Fossati: ReaderExperiments' ImageBrowsing stream configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [09:27:56] RECOVERY - Kafka broker TLS certificate validity on kafka-main2008 is OK: SSL OK - Certificate kafka-main2008.codfw.wmnet valid until 2026-08-23 08:30:00 +0000 (expires in 339 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:28:44] !log mass deleting watchlist of bots with > 50K watchlist rows (T404808) [09:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:48] T404808: Clean up large bots watchlists in all wikis - https://phabricator.wikimedia.org/T404808 [09:29:38] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:33:34] (03PS3) 10Elukey: WIP: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) [09:33:34] RECOVERY - Kafka broker TLS certificate validity on kafka-main2009 is OK: SSL OK - Certificate kafka-main2009.codfw.wmnet valid until 2026-08-23 08:27:00 +0000 (expires in 339 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:33:56] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:35:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Add es2049', diff saved to https://phabricator.wikimedia.org/P83385 and previous config saved to /var/cache/conftool/dbconfig/20250917-093550-fceratto.json [09:36:06] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2014.codfw.wmnet with reason: host reimage [09:36:13] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:37:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1251 from api group of s1 (T403966)', diff saved to https://phabricator.wikimedia.org/P83386 and previous config saved to /var/cache/conftool/dbconfig/20250917-093718-ladsgroup.json [09:37:22] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [09:38:05] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2049 slowly with 10 steps - Pooling in new host [09:39:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2014.codfw.wmnet with reason: host reimage [09:40:09] RECOVERY - Kafka broker TLS certificate validity on kafka-main2010 is OK: SSL OK - Certificate kafka-main2010.codfw.wmnet valid until 2026-08-23 08:37:00 +0000 (expires in 339 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:40:18] (03CR) 10CI reject: [V:04-1] WIP: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [09:41:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1184 (s1 candidate master) from api group of s1 (T403966)', diff saved to https://phabricator.wikimedia.org/P83388 and previous config saved to /var/cache/conftool/dbconfig/20250917-094124-ladsgroup.json [09:41:43] RECOVERY - Kafka broker TLS certificate validity on kafka-main2007 is OK: SSL OK - Certificate kafka-main2007.codfw.wmnet valid until 2026-08-23 08:27:00 +0000 (expires in 339 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:42:17] !log elukey@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw [09:42:47] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:42:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:43:11] 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826 (10Joe) 03NEW [09:43:53] jouncebot: nowandnext [09:43:53] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [09:43:53] In 0 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1000) [09:50:33] (03PS4) 10Elukey: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:51:55] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:52:22] (03CR) 10Ladsgroup: [C:03+1] preseed.yaml: Remove es2050 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1188738 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:52:59] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:53:14] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:54:26] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:21] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2057.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:57:23] elukey@cumin1003 provision (PID 2947963) is awaiting input [09:57:32] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:57:32] (03PS3) 10Arnaudb: Revert^4 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188798 (https://phabricator.wikimedia.org/T353891) [09:59:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2014.codfw.wmnet with OS bookworm [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1000) [10:01:46] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, and 2 others: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11188609 (10ABran-WMF) https://gerrit.wikimedia.org/r/1188798 creates a 1GB local disk cache that should help with thos... [10:01:52] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:02:53] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [10:03:23] hi everybody, going to deploy mobileapps as part of the MW infra window (upgrading the statsd sidecar only) [10:04:37] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: sync [10:05:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:05:18] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: sync [10:06:13] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS bookworm [10:07:35] (03CR) 10Muehlenhoff: [C:03+2] Re-add maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1189110 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:09:26] jouncebot: nowandnext [10:09:26] For the next 0 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1000) [10:09:26] In 0 hour(s) and 50 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1100) [10:10:03] elukey: Can I deploy a operations/mediawiki-config change while you are doing that or would you prefer not? [10:11:37] (03PS1) 10Dreamy Jazz: Deploy suggested investigations to testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189125 (https://phabricator.wikimedia.org/T404830) [10:12:00] Dreamy_Jazz: o/ already done, the only thing that worries me a little is https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=panel-13&from=now-6h&to=now&timezone=utc [10:12:29] Yeah, that isn't ideal [10:12:35] it is trending down, let's see if it settles [10:12:36] (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml: Remove es2050 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1188738 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:14:33] (03CR) 10Mszwarc: [C:03+1] Deploy suggested investigations to testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189125 (https://phabricator.wikimedia.org/T404830) (owner: 10Dreamy Jazz) [10:16:15] !log installing openjpeg2 security updates [10:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:17] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2057.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:18:02] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2058.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:18:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189125 (https://phabricator.wikimedia.org/T404830) (owner: 10Dreamy Jazz) [10:19:46] (03Merged) 10jenkins-bot: Deploy suggested investigations to testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189125 (https://phabricator.wikimedia.org/T404830) (owner: 10Dreamy Jazz) [10:20:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:20:12] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1189125|Deploy suggested investigations to testwiki and test2wiki (T404830)]] [10:20:17] T404830: Deploy suggested investigations to testwiki and test2wiki - https://phabricator.wikimedia.org/T404830 [10:22:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1235 from api group of s1 (T403966)', diff saved to https://phabricator.wikimedia.org/P83391 and previous config saved to /var/cache/conftool/dbconfig/20250917-102225-ladsgroup.json [10:22:31] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [10:24:42] Dreamy_Jazz: green light [10:25:28] Thanks. I was seeing a trend down to 0, so started already [10:26:08] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1189125|Deploy suggested investigations to testwiki and test2wiki (T404830)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:26:13] T404830: Deploy suggested investigations to testwiki and test2wiki - https://phabricator.wikimedia.org/T404830 [10:27:58] !log dreamyjazz@deploy1003 Sync cancelled. [10:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:08] (03PS1) 10Dreamy Jazz: Set virtual domain mapping for virtual-checkuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189126 (https://phabricator.wikimedia.org/T404830) [10:30:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189126 (https://phabricator.wikimedia.org/T404830) (owner: 10Dreamy Jazz) [10:30:42] (03CR) 10Hnowlan: [C:03+1] switchdc: call delete_collection_namespaced_cron_job if available [cookbooks] - 10https://gerrit.wikimedia.org/r/1187544 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [10:31:31] (03Merged) 10jenkins-bot: Set virtual domain mapping for virtual-checkuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189126 (https://phabricator.wikimedia.org/T404830) (owner: 10Dreamy Jazz) [10:31:58] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1189126|Set virtual domain mapping for virtual-checkuser (T404830)]], [[gerrit:1189125|Deploy suggested investigations to testwiki and test2wiki (T404830)]] [10:32:02] T404830: Deploy suggested investigations to testwiki and test2wiki - https://phabricator.wikimedia.org/T404830 [10:33:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1169 from api group of s1 (T403966)', diff saved to https://phabricator.wikimedia.org/P83393 and previous config saved to /var/cache/conftool/dbconfig/20250917-103306-ladsgroup.json [10:33:11] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [10:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2058.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:37:39] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1189126|Set virtual domain mapping for virtual-checkuser (T404830)]], [[gerrit:1189125|Deploy suggested investigations to testwiki and test2wiki (T404830)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:37:43] T404830: Deploy suggested investigations to testwiki and test2wiki - https://phabricator.wikimedia.org/T404830 [10:40:42] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11188841 (10elukey) Updated the BMC and the firmware, that seems still in progress. Will check later on :) [10:42:37] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [10:46:41] !log trigger full OSM import on maps2011 T381565 [10:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:45] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [10:47:02] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186461 (owner: 10PipelineBot) [10:47:56] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189126|Set virtual domain mapping for virtual-checkuser (T404830)]], [[gerrit:1189125|Deploy suggested investigations to testwiki and test2wiki (T404830)]] (duration: 15m 58s) [10:48:00] T404830: Deploy suggested investigations to testwiki and test2wiki - https://phabricator.wikimedia.org/T404830 [10:48:27] (03PS5) 10Stevemunene: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) [10:48:30] (03CR) 10Brouberol: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [10:48:42] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186461 (owner: 10PipelineBot) [10:49:32] 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11188866 (10SLyngshede-WMF) Personally I don't love the private repository with Puppet code inside it, as it hides a lot of information. I get that this is the idea, but it mak... [10:50:31] (03CR) 10CI reject: [V:04-1] dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [10:51:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s1 in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83395 and previous config saved to /var/cache/conftool/dbconfig/20250917-105102-ladsgroup.json [10:51:07] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [10:54:46] (03PS6) 10Brouberol: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [10:55:54] (03PS3) 10Stevemunene: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) [10:57:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s3 in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83397 and previous config saved to /var/cache/conftool/dbconfig/20250917-105709-ladsgroup.json [10:57:15] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [10:57:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:58:27] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:59:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1196 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83398 and previous config saved to /var/cache/conftool/dbconfig/20250917-105946-ladsgroup.json [11:00:05] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1100). nyaa~ [11:00:11] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:00:31] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:00:36] 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11188888 (10MoritzMuehlenhoff) It's worth mentioning that starting next quarter we'll start work on moving the user data currently defined in data.yaml to a private repository,... [11:02:01] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:02:37] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:02:41] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [11:03:01] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:06:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11188900 (10elukey) Done up to cp2058, all good (excluding cp2056 as requested). Next steps: - Upgrade firmwares - Check why the cookbook didn't run on cp2052... [11:08:05] elukey@cumin1003 reimage (PID 2954571) is awaiting input [11:08:32] (03PS1) 10Jelto: ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) [11:09:17] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11188903 (10elukey) Tried provision and then reimage, this time I clearly noticed a PXE/HTTP boot request but it ended up in the OS booting (it was quick and... [11:09:50] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:10:16] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:12:39] (03PS1) 10Federico Ceratto: es2050.yaml, site.pp: Prepare es2050 [puppet] - 10https://gerrit.wikimedia.org/r/1189131 (https://phabricator.wikimedia.org/T402859) [11:13:31] (03PS1) 10Hnowlan: trafficserver: multi-dc: use client request host rather than remap [puppet] - 10https://gerrit.wikimedia.org/r/1189132 (https://phabricator.wikimedia.org/T401396) [11:15:26] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11188939 (10Jclark-ctr) Updated the Cable IDs for cables run on ssw1-d1-eqiad in NetBox. @VRiley-WMF, please update the ssw1-d8-eqiad Cable IDs when you have a chance. [11:15:58] (03CR) 10Ladsgroup: [C:03+1] es2050.yaml, site.pp: Prepare es2050 [puppet] - 10https://gerrit.wikimedia.org/r/1189131 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:16:42] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS bookworm [11:18:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump weight of db1167 in general group (T403966)', diff saved to https://phabricator.wikimedia.org/P83400 and previous config saved to /var/cache/conftool/dbconfig/20250917-111858-ladsgroup.json [11:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:19:04] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [11:20:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump weight of db2152 in general group (T403966)', diff saved to https://phabricator.wikimedia.org/P83401 and previous config saved to /var/cache/conftool/dbconfig/20250917-112010-ladsgroup.json [11:20:51] (03PS1) 10Stevemunene: Add a dummy Ceph user keys for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) [11:26:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:26:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:30:13] (03PS7) 10Stevemunene: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) [11:32:45] jouncebot: nowandnext [11:32:45] For the next 0 hour(s) and 27 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1100) [11:32:45] In 0 hour(s) and 27 minute(s): Create new table for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1200) [11:33:26] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11189010 (10ABran-WMF) cache is around 100MB and the UI is slowing down again [11:33:55] (03PS1) 10Dreamy Jazz: SI: Load ext.checkUser.styles on Special:SuggestedInvestigations [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189136 (https://phabricator.wikimedia.org/T404712) [11:34:05] (03CR) 10Dreamy Jazz: [C:03+2] SI: Load ext.checkUser.styles on Special:SuggestedInvestigations [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189136 (https://phabricator.wikimedia.org/T404712) (owner: 10Dreamy Jazz) [11:34:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189136 (https://phabricator.wikimedia.org/T404712) (owner: 10Dreamy Jazz) [11:36:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:39:02] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:39:29] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:45:33] (03Merged) 10jenkins-bot: SI: Load ext.checkUser.styles on Special:SuggestedInvestigations [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189136 (https://phabricator.wikimedia.org/T404712) (owner: 10Dreamy Jazz) [11:46:00] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1189136|SI: Load ext.checkUser.styles on Special:SuggestedInvestigations (T404712)]] [11:46:06] T404712: Suggested investigations: Subtitle links are not rendered correctly when UserInfoCard / IP reveal is not enabled - https://phabricator.wikimedia.org/T404712 [11:51:39] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1189136|SI: Load ext.checkUser.styles on Special:SuggestedInvestigations (T404712)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:51:44] T404712: Suggested investigations: Subtitle links are not rendered correctly when UserInfoCard / IP reveal is not enabled - https://phabricator.wikimedia.org/T404712 [11:52:23] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [11:52:26] (03PS3) 10Klausman: team-ml: Add alert for outdated admin_ng config [alerts] - 10https://gerrit.wikimedia.org/r/1182531 (https://phabricator.wikimedia.org/T403047) [11:54:12] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2049 slowly with 10 steps - Pooling in new host [11:55:03] (03CR) 10Klausman: [V:03+2 C:03+2] team-ml: Add alert for outdated admin_ng config [alerts] - 10https://gerrit.wikimedia.org/r/1182531 (https://phabricator.wikimedia.org/T403047) (owner: 10Klausman) [11:56:40] (03Merged) 10jenkins-bot: team-ml: Add alert for outdated admin_ng config [alerts] - 10https://gerrit.wikimedia.org/r/1182531 (https://phabricator.wikimedia.org/T403047) (owner: 10Klausman) [11:57:30] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189136|SI: Load ext.checkUser.styles on Special:SuggestedInvestigations (T404712)]] (duration: 11m 29s) [11:57:34] T404712: Suggested investigations: Subtitle links are not rendered correctly when UserInfoCard / IP reveal is not enabled - https://phabricator.wikimedia.org/T404712 [11:57:39] (03CR) 10Muehlenhoff: [C:03+2] Assign failoid role to failoid2003 [puppet] - 10https://gerrit.wikimedia.org/r/1183100 (https://phabricator.wikimedia.org/T402406) (owner: 10Muehlenhoff) [11:58:12] (03PS1) 10Dreamy Jazz: Use correct DB domain in SuggestedInvestigationsCaseLookupService [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189141 (https://phabricator.wikimedia.org/T404846) [11:58:15] jouncebot: nowandnext [11:58:15] For the next 0 hour(s) and 1 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1100) [11:58:15] In 0 hour(s) and 1 minute(s): Create new table for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1200) [11:58:46] daimona: Do you mind if I am deploying during your window? [11:59:03] (03PS1) 10Slyngshede: P:idp remove NDA group access from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/1189142 (https://phabricator.wikimedia.org/T404494) [11:59:11] No problem, I'm probably just going to take a couple minutes to create the table [11:59:16] Thanks [11:59:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189141 (https://phabricator.wikimedia.org/T404846) (owner: 10Dreamy Jazz) [12:00:05] Daimona: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Create new table for the CampaignEvents extension deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1200). [12:01:50] (03PS3) 10Slyngshede: Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) [12:03:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:03:52] !log Creating new tables for the CampaignEvents extension in x1.testwiki, x1.test2wiki, x1.officewiki, and x1.wikishared # T400719 [12:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:57] T400719: Create database structure to store edit-to-event associations - https://phabricator.wikimedia.org/T400719 [12:04:18] (03PS1) 10Muehlenhoff: Move my non-FIDO SSH key to buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1189145 [12:05:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [12:05:08] (03CR) 10Federico Ceratto: [C:03+2] es2050.yaml, site.pp: Prepare es2050 [puppet] - 10https://gerrit.wikimedia.org/r/1189131 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:06:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1189142 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [12:09:58] Done with my DB stuff. [12:10:54] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188875 (owner: 10PipelineBot) [12:11:11] (03Merged) 10jenkins-bot: Use correct DB domain in SuggestedInvestigationsCaseLookupService [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189141 (https://phabricator.wikimedia.org/T404846) (owner: 10Dreamy Jazz) [12:11:35] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1189141|Use correct DB domain in SuggestedInvestigationsCaseLookupService (T404846)]] [12:11:40] T404846: Suggested Investigations: SuggestedInvestigationsCaseLookupService uses the wrong database connection - https://phabricator.wikimedia.org/T404846 [12:12:38] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188875 (owner: 10PipelineBot) [12:12:50] (03PS2) 10Muehlenhoff: Failover failoid in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1183101 (https://phabricator.wikimedia.org/T402406) [12:14:56] (03PS1) 10Federico Ceratto: instances.yaml: add es2050 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1189148 (https://phabricator.wikimedia.org/T402859) [12:15:55] (03CR) 10Brouberol: [C:03+1] dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [12:17:22] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1189141|Use correct DB domain in SuggestedInvestigationsCaseLookupService (T404846)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:17:28] T404846: Suggested Investigations: SuggestedInvestigationsCaseLookupService uses the wrong database connection - https://phabricator.wikimedia.org/T404846 [12:18:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:20:02] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [12:25:13] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189141|Use correct DB domain in SuggestedInvestigationsCaseLookupService (T404846)]] (duration: 13m 37s) [12:25:18] T404846: Suggested Investigations: SuggestedInvestigationsCaseLookupService uses the wrong database connection - https://phabricator.wikimedia.org/T404846 [12:27:29] (03PS1) 10Muehlenhoff: Remove obsolete configuration options from SSH type [puppet] - 10https://gerrit.wikimedia.org/r/1189149 [12:28:27] (03PS2) 10Muehlenhoff: Remove obsolete configuration options from SSH type [puppet] - 10https://gerrit.wikimedia.org/r/1189149 [12:30:17] FIRING: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:30:32] (03CR) 10Slyngshede: Permissions: Prevent duplicate permission requests (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [12:30:41] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for es2050.codfw.wmnet [12:30:43] (03CR) 10Slyngshede: [C:03+2] Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [12:31:56] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11189225 (10MoritzMuehlenhoff) [12:33:13] (03Merged) 10jenkins-bot: Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [12:33:50] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1189145 (owner: 10Muehlenhoff) [12:34:01] fceratto@cumin1002 upgrade (PID 2721453) is awaiting input [12:35:13] (03CR) 10Brouberol: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [12:35:17] RESOLVED: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:36:31] (03CR) 10A smart kitten: "@phuedx@wikimedia.org are we okay to create a revert of this revert, to hopefully be deployed at some point in the future?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188281 (owner: 10Phuedx) [12:38:04] (03CR) 10Brouberol: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [12:38:28] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11189253 (10MoritzMuehlenhoff) [12:38:38] (03CR) 10Muehlenhoff: [C:03+2] Move my non-FIDO SSH key to buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1189145 (owner: 10Muehlenhoff) [12:38:45] (03CR) 10Brouberol: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [12:41:02] (03CR) 10Brouberol: Add a dummy Ceph user keys for the cephcsi plugin to use (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [12:45:54] fceratto@cumin1002 upgrade (PID 2721453) is awaiting input [12:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:49:13] (03CR) 10David Caro: [C:03+1] ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [12:49:31] (03CR) 10Stevemunene: [C:03+2] dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [12:50:02] (03CR) 10David Caro: [C:03+1] ceph: Drop buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [12:51:12] (03CR) 10Filippo Giunchedi: [C:03+1] Remove obsolete configuration options from SSH type [puppet] - 10https://gerrit.wikimedia.org/r/1189149 (owner: 10Muehlenhoff) [12:51:30] (03CR) 10Cathal Mooney: [C:03+2] Cephosd: revert to manually setting up peering IPs [puppet] - 10https://gerrit.wikimedia.org/r/1189119 (owner: 10Cathal Mooney) [12:52:35] PROBLEM - statsv Varnishkafka log producer on cp7007 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:53:35] RECOVERY - statsv Varnishkafka log producer on cp7007 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:54:08] (03PS1) 10CDanis: admin: cdanis ssh: clean up old keys + add new fido2 key [puppet] - 10https://gerrit.wikimedia.org/r/1189153 [12:54:39] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:59:29] (03CR) 10Arnaudb: [C:03+1] "I've added a question inline, otherwise: looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:15] o/ [13:00:23] yup, I don’t see anything in the calendar either [13:00:46] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178858 (owner: 10PipelineBot) [13:00:49] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179669 (owner: 10PipelineBot) [13:00:52] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184706 (owner: 10PipelineBot) [13:00:55] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185992 (owner: 10PipelineBot) [13:01:00] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187854 (owner: 10PipelineBot) [13:01:24] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2050.codfw.wmnet [13:01:43] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone_es of es2027.codfw.wmnet onto es2050.codfw.wmnet [13:01:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2027 - Depool es2027.codfw.wmnet to then clone it to es2050.codfw.wmnet - fceratto@cumin1002 [13:03:02] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es2027 - Depool es2027.codfw.wmnet to then clone it to es2050.codfw.wmnet - fceratto@cumin1002 [13:03:05] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-09-09-171717 to 2025-09-16-190551 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189155 (https://phabricator.wikimedia.org/T399323) [13:03:07] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-09-08-191243 to 2025-09-16-134119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189156 (https://phabricator.wikimedia.org/T397956) [13:03:21] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2027 - Depool es2027.codfw.wmnet to then clone it to es2050.codfw.wmnet - fceratto@cumin1002 [13:03:30] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es2027 - Depool es2027.codfw.wmnet to then clone it to es2050.codfw.wmnet - fceratto@cumin1002 [13:03:58] (03PS2) 10Stevemunene: Add a dummy Ceph user keys for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) [13:04:01] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone_es (exit_code=99) of es2027.codfw.wmnet onto es2050.codfw.wmnet [13:04:23] (03PS3) 10Federico Ceratto: clone_es.py: clone readonly es* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 [13:04:46] (03PS2) 10Jelto: ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) [13:06:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:06:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:08:14] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6972/co" [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:08:41] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:09:29] RECOVERY - BFD status on lsw1-c2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:10:14] (03PS3) 10Majavah: ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 [13:10:17] (03CR) 10Jelto: [V:03+1] ceph: add module to sync a bucket locally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:10:40] (03Merged) 10jenkins-bot: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [13:10:43] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:10:56] (03CR) 10CI reject: [V:04-1] ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [13:11:34] (03PS4) 10Majavah: ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 [13:12:41] (03CR) 10Majavah: [C:03+2] ceph: Drop buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [13:13:43] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:13:43] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:17:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1232 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83407 and previous config saved to /var/cache/conftool/dbconfig/20250917-131718-ladsgroup.json [13:17:23] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [13:22:03] RECOVERY - BFD status on lsw1-d2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:23:45] RECOVERY - BFD status on lsw1-a7-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:27:59] (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:29:37] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2027 - Depool for cloning [13:29:48] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es2027 - Depool for cloning [13:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:50] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [13:34:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool es2027 T402859', diff saved to https://phabricator.wikimedia.org/P83408 and previous config saved to /var/cache/conftool/dbconfig/20250917-133454-fceratto.json [13:34:59] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [13:35:18] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone_es of es2027.codfw.wmnet onto es2050.codfw.wmnet [13:35:22] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2027 - Depool es2027.codfw.wmnet to then clone it to es2050.codfw.wmnet - fceratto@cumin1002 [13:36:55] (03PS1) 10Muehlenhoff: Apply installserver role to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189169 (https://phabricator.wikimedia.org/T396487) [13:36:57] (03PS1) 10Muehlenhoff: Update DHCP server in eqiad to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189170 (https://phabricator.wikimedia.org/T396487) [13:36:59] (03PS1) 10Muehlenhoff: Update the proxies used by cloudcumin to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189171 (https://phabricator.wikimedia.org/T396487) [13:37:42] (03PS1) 10Muehlenhoff: Point webproxy in eqiad to install1005 [dns] - 10https://gerrit.wikimedia.org/r/1189173 (https://phabricator.wikimedia.org/T396487) [13:38:04] (03PS2) 10Jforrester: Enable Wikifunctions client mode on Wiktionaries, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172047 (https://phabricator.wikimedia.org/T397401) [13:38:25] fceratto@cumin1002 clone_es (PID 2838644) is awaiting input [13:38:27] (03CR) 10CI reject: [V:04-1] Enable Wikifunctions client mode on Wiktionaries, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172047 (https://phabricator.wikimedia.org/T397401) (owner: 10Jforrester) [13:39:35] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11189472 (10elukey) I retried again, "No media present" :( [13:40:03] (03PS2) 10Majavah: P:wmcs::metricsinfra: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188828 [13:40:07] elukey@cumin1003 reimage (PID 2969566) is awaiting input [13:40:32] (03CR) 10CI reject: [V:04-1] P:wmcs::metricsinfra: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188828 (owner: 10Majavah)