[00:00:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P82799 and previous config saved to /var/cache/conftool/dbconfig/20250909-000014-ladsgroup.json [00:01:47] FIRING: [4x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:06:47] RESOLVED: [4x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:07:24] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1150.eqiad.wmnet with reason: Maintenance [00:08:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1186105 [00:08:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1186105 (owner: 10TrainBranchBot) [00:10:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Upgrade db1158 to MariaDB 10.11 (T399955)', diff saved to https://phabricator.wikimedia.org/P82801 and previous config saved to /var/cache/conftool/dbconfig/20250909-001050-ladsgroup.json [00:10:54] T399955: Migrate s7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T399955 [00:11:14] (03PS2) 10RLazarus: all charts: Update to mesh.configuration 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) [00:11:14] (03PS1) 10RLazarus: mathoid: Update to mesh.configuration 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186107 (https://phabricator.wikimedia.org/T403101) [00:11:38] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1158.eqiad.wmnet with reason: Upgrade to 10.11 [00:12:43] (03PS2) 10Cwhite: mediawiki: enable forward of fatal metrics to statsd exporter [puppet] - 10https://gerrit.wikimedia.org/r/1049625 (https://phabricator.wikimedia.org/T356814) [00:12:59] (03PS2) 10RLazarus: mathoid: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186099 (https://phabricator.wikimedia.org/T403663) [00:13:20] (03PS1) 10Ladsgroup: db1158: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186108 (https://phabricator.wikimedia.org/T399955) [00:14:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T402925)', diff saved to https://phabricator.wikimedia.org/P82802 and previous config saved to /var/cache/conftool/dbconfig/20250909-001434-ladsgroup.json [00:14:39] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [00:14:50] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance [00:14:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T402925)', diff saved to https://phabricator.wikimedia.org/P82803 and previous config saved to /var/cache/conftool/dbconfig/20250909-001457-ladsgroup.json [00:15:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T402925)', diff saved to https://phabricator.wikimedia.org/P82804 and previous config saved to /var/cache/conftool/dbconfig/20250909-001522-ladsgroup.json [00:15:27] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [00:16:09] (03CR) 10Ladsgroup: [C:03+2] db1158: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186108 (https://phabricator.wikimedia.org/T399955) (owner: 10Ladsgroup) [00:16:24] (03CR) 10Scott French: [C:03+1] mathoid: Update to mesh.configuration 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186107 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:17:30] (03CR) 10Scott French: [C:03+1] mathoid: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186099 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [00:18:33] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11160755 (10Etonkovidova) [00:21:56] (03CR) 10RLazarus: [C:03+2] mathoid: Update to mesh.configuration 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186107 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:23:28] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1158 gradually with 4 steps - Maint over [00:23:58] (03Merged) 10jenkins-bot: mathoid: Update to mesh.configuration 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186107 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:24:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:26:00] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [00:26:07] (03CR) 10RLazarus: [C:03+2] mathoid: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186099 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [00:28:17] (03Merged) 10jenkins-bot: mathoid: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186099 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [00:28:26] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1190.eqiad.wmnet with reason: Maintenance [00:28:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1190 (T402763)', diff saved to https://phabricator.wikimedia.org/P82808 and previous config saved to /var/cache/conftool/dbconfig/20250909-002833-ladsgroup.json [00:28:37] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [00:29:03] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mathoid: apply [00:29:27] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mathoid: apply [00:29:53] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [00:30:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1186105 (owner: 10TrainBranchBot) [00:30:34] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mathoid: apply [00:31:07] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [00:31:13] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mathoid: apply [00:31:43] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [00:38:58] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [00:38:59] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11160824 (10Alchimista) I'm sorry @BCornwall, but as Waldir mentioned, when responding to your email, it didn't include your email, sorry for that. In 2022 we were aske... [00:39:10] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1223 gradually with 4 steps - Maint over [00:39:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:42:03] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [00:44:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T402925)', diff saved to https://phabricator.wikimedia.org/P82811 and previous config saved to /var/cache/conftool/dbconfig/20250909-004437-ladsgroup.json [00:44:41] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [00:46:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:47:15] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [00:48:00] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Envoy config updates from v1.26 - https://phabricator.wikimedia.org/T403101#11160842 (10RLazarus) [00:48:37] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [00:49:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T402763)', diff saved to https://phabricator.wikimedia.org/P82812 and previous config saved to /var/cache/conftool/dbconfig/20250909-004931-ladsgroup.json [00:49:35] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [00:59:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P82814 and previous config saved to /var/cache/conftool/dbconfig/20250909-005944-ladsgroup.json [01:00:54] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:04:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P82815 and previous config saved to /var/cache/conftool/dbconfig/20250909-010439-ladsgroup.json [01:05:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 (owner: 10Krinkle) [01:06:45] (03Merged) 10jenkins-bot: tests: Add test for wmfApplyEtcdDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 (owner: 10Krinkle) [01:07:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.18 [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186122 (https://phabricator.wikimedia.org/T396379) [01:07:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.18 [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186122 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot) [01:08:55] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1158 gradually with 4 steps - Maint over [01:10:13] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036 (10RLazarus) 03NEW [01:13:05] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 12m 10s) [01:13:14] RECOVERY - dump of x1 in codfw on backupmon1001 is OK: Last dump for x1 at codfw (db2197) taken on 2025-09-09 00:23:45 (97 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:13:17] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1182228|tests: Add test for wmfApplyEtcdDBConfig()]] [01:14:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P82817 and previous config saved to /var/cache/conftool/dbconfig/20250909-011452-ladsgroup.json [01:18:14] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11160892 (10RLazarus) Note the `event_log_path` comes up in `mesh.configuration._tcp_cluster`, which pulls in the entire `health_checks` field from values.yaml, so that's where the `event_log_path`s are... [01:19:11] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1182228|tests: Add test for wmfApplyEtcdDBConfig()]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:19:46] Huh, normally those short-circuit saying "test/labs only" [01:19:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P82818 and previous config saved to /var/cache/conftool/dbconfig/20250909-011950-ladsgroup.json [01:21:13] https://gitlab.wikimedia.org/repos/releng/scap/-/blob/34710433437da7276124168a3f713a57902b3cc9/scap/backport.py#L417 [01:21:14] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.18 [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186122 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot) [01:21:24] !log krinkle@deploy1003 Sync cancelled. [01:21:39] Looks like we have a beta filter, not a test/* filter [01:30:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T402925)', diff saved to https://phabricator.wikimedia.org/P82819 and previous config saved to /var/cache/conftool/dbconfig/20250909-013000-ladsgroup.json [01:30:05] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:30:16] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2202.codfw.wmnet with reason: Maintenance [01:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:34:31] I'm leaving this merged, the same way as if it was merged and then git-pulled on deploy1003. Not great, but not worth reverting or deploying right now since it is a no-op. [01:34:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T402763)', diff saved to https://phabricator.wikimedia.org/P82820 and previous config saved to /var/cache/conftool/dbconfig/20250909-013458-ladsgroup.json [01:35:03] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [01:35:14] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1199.eqiad.wmnet with reason: Maintenance [01:35:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1199 (T402763)', diff saved to https://phabricator.wikimedia.org/P82821 and previous config saved to /var/cache/conftool/dbconfig/20250909-013521-ladsgroup.json [01:36:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:39:20] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11160928 (10RLazarus) The global_downstream_max_connections was deprecated in the 1.28 release notes, but as of 1.29, the downstream connections resource_monitor was still a work in progress. So we won't... [01:39:50] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11160929 (10RLazarus) [01:40:56] RECOVERY - dump of x1 in eqiad on backupmon1001 is OK: Last dump for x1 at eqiad (db1216) taken on 2025-09-09 00:00:07 (97 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:43:57] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:46:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:49:59] 06SRE, 06Release-Engineering-Team: docker-registry will show different last updated time as you refresh the page... - https://phabricator.wikimedia.org/T404011#11160935 (10Reedy) [01:56:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T402763)', diff saved to https://phabricator.wikimedia.org/P82822 and previous config saved to /var/cache/conftool/dbconfig/20250909-015658-ladsgroup.json [01:57:03] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [01:59:36] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2203.codfw.wmnet with reason: Maintenance [01:59:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2203 (T402925)', diff saved to https://phabricator.wikimedia.org/P82823 and previous config saved to /var/cache/conftool/dbconfig/20250909-015943-ladsgroup.json [01:59:48] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T0200) [02:08:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11160972 (10Papaul) @ayounsi @cmooney can you do the test Juniper asked us to do tomorrow Sept. 9th after the meeting link around 11:15am CT?... [02:12:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P82824 and previous config saved to /var/cache/conftool/dbconfig/20250909-021206-ladsgroup.json [02:25:28] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:26:28] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [02:27:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P82825 and previous config saved to /var/cache/conftool/dbconfig/20250909-022713-ladsgroup.json [02:27:18] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [02:27:18] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:28:57] FIRING: [5x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:31:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T402925)', diff saved to https://phabricator.wikimedia.org/P82826 and previous config saved to /var/cache/conftool/dbconfig/20250909-023146-ladsgroup.json [02:31:51] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:42:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T402763)', diff saved to https://phabricator.wikimedia.org/P82827 and previous config saved to /var/cache/conftool/dbconfig/20250909-024221-ladsgroup.json [02:42:26] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [02:42:38] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1221.eqiad.wmnet with reason: Maintenance [02:42:56] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [02:43:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1221 (T402763)', diff saved to https://phabricator.wikimedia.org/P82828 and previous config saved to /var/cache/conftool/dbconfig/20250909-024303-ladsgroup.json [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T0300) [03:02:01] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186138 (https://phabricator.wikimedia.org/T396379) [03:02:03] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186138 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot) [03:02:54] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186138 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot) [03:03:21] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.18 refs T396379 [03:03:24] T396379: 1.45.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T396379 [03:03:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T402763)', diff saved to https://phabricator.wikimedia.org/P82829 and previous config saved to /var/cache/conftool/dbconfig/20250909-030350-ladsgroup.json [03:03:55] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [03:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:17:24] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2216.codfw.wmnet with reason: Maintenance [03:17:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T402925)', diff saved to https://phabricator.wikimedia.org/P82830 and previous config saved to /var/cache/conftool/dbconfig/20250909-031731-ladsgroup.json [03:17:38] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [03:18:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P82831 and previous config saved to /var/cache/conftool/dbconfig/20250909-031858-ladsgroup.json [03:34:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P82832 and previous config saved to /var/cache/conftool/dbconfig/20250909-033405-ladsgroup.json [03:35:58] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1251.eqiad.wmnet with reason: Maintenance [03:36:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1251 (T402925)', diff saved to https://phabricator.wikimedia.org/P82833 and previous config saved to /var/cache/conftool/dbconfig/20250909-033605-ladsgroup.json [03:36:09] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [03:39:52] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:44:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:47:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T402925)', diff saved to https://phabricator.wikimedia.org/P82834 and previous config saved to /var/cache/conftool/dbconfig/20250909-034740-ladsgroup.json [03:47:45] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [03:49:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T402763)', diff saved to https://phabricator.wikimedia.org/P82835 and previous config saved to /var/cache/conftool/dbconfig/20250909-034913-ladsgroup.json [03:49:17] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [03:49:18] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1238.eqiad.wmnet with reason: Maintenance [03:49:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1238 (T402763)', diff saved to https://phabricator.wikimedia.org/P82836 and previous config saved to /var/cache/conftool/dbconfig/20250909-034924-ladsgroup.json [03:56:02] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.18 refs T396379 (duration: 52m 41s) [03:56:06] T396379: 1.45.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T396379 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T0400) [04:01:13] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.15 (duration: 01m 04s) [04:02:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P82837 and previous config saved to /var/cache/conftool/dbconfig/20250909-040247-ladsgroup.json [04:06:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T402925)', diff saved to https://phabricator.wikimedia.org/P82838 and previous config saved to /var/cache/conftool/dbconfig/20250909-040632-ladsgroup.json [04:06:37] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:08:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T402763)', diff saved to https://phabricator.wikimedia.org/P82839 and previous config saved to /var/cache/conftool/dbconfig/20250909-040802-ladsgroup.json [04:08:07] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [04:10:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:17:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P82840 and previous config saved to /var/cache/conftool/dbconfig/20250909-041755-ladsgroup.json [04:21:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P82841 and previous config saved to /var/cache/conftool/dbconfig/20250909-042140-ladsgroup.json [04:23:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P82842 and previous config saved to /var/cache/conftool/dbconfig/20250909-042310-ladsgroup.json [04:33:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T402925)', diff saved to https://phabricator.wikimedia.org/P82843 and previous config saved to /var/cache/conftool/dbconfig/20250909-043302-ladsgroup.json [04:33:07] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:36:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P82844 and previous config saved to /var/cache/conftool/dbconfig/20250909-043647-ladsgroup.json [04:38:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P82845 and previous config saved to /var/cache/conftool/dbconfig/20250909-043818-ladsgroup.json [04:42:46] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 7179.95 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:51:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T402925)', diff saved to https://phabricator.wikimedia.org/P82846 and previous config saved to /var/cache/conftool/dbconfig/20250909-045155-ladsgroup.json [04:51:59] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:52:11] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [04:53:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T402763)', diff saved to https://phabricator.wikimedia.org/P82847 and previous config saved to /var/cache/conftool/dbconfig/20250909-045325-ladsgroup.json [04:53:30] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [04:53:41] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1241.eqiad.wmnet with reason: Maintenance [04:53:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1241 (T402763)', diff saved to https://phabricator.wikimedia.org/P82848 and previous config saved to /var/cache/conftool/dbconfig/20250909-045348-ladsgroup.json [04:56:26] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [05:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:06:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [05:08:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T402763)', diff saved to https://phabricator.wikimedia.org/P82849 and previous config saved to /var/cache/conftool/dbconfig/20250909-051201-ladsgroup.json [05:12:05] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [05:20:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:21:17] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [05:21:34] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [05:21:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T402925)', diff saved to https://phabricator.wikimedia.org/P82850 and previous config saved to /var/cache/conftool/dbconfig/20250909-052142-ladsgroup.json [05:21:47] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [05:22:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T402925)', diff saved to https://phabricator.wikimedia.org/P82851 and previous config saved to /var/cache/conftool/dbconfig/20250909-052251-ladsgroup.json [05:26:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:27:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P82852 and previous config saved to /var/cache/conftool/dbconfig/20250909-052708-ladsgroup.json [05:29:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:31:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 5.759 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:33:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:27] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 7063 [05:34:48] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 7063 [05:36:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:37:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:38:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P82853 and previous config saved to /var/cache/conftool/dbconfig/20250909-053759-ladsgroup.json [05:39:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:41:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:42:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:42:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P82854 and previous config saved to /var/cache/conftool/dbconfig/20250909-054216-ladsgroup.json [05:46:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:46:36] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:46:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:53:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P82855 and previous config saved to /var/cache/conftool/dbconfig/20250909-055307-ladsgroup.json [05:57:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T402763)', diff saved to https://phabricator.wikimedia.org/P82856 and previous config saved to /var/cache/conftool/dbconfig/20250909-055724-ladsgroup.json [05:57:28] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [05:57:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1242.eqiad.wmnet with reason: Maintenance [05:57:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1242 (T402763)', diff saved to https://phabricator.wikimedia.org/P82857 and previous config saved to /var/cache/conftool/dbconfig/20250909-055748-ladsgroup.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T0600) [06:00:04] marostegui, Amir1, and federico3: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T0600). [06:01:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:08:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T402925)', diff saved to https://phabricator.wikimedia.org/P82858 and previous config saved to /var/cache/conftool/dbconfig/20250909-060814-ladsgroup.json [06:08:19] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [06:08:30] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [06:08:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T402925)', diff saved to https://phabricator.wikimedia.org/P82859 and previous config saved to /var/cache/conftool/dbconfig/20250909-060837-ladsgroup.json [06:09:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T402925)', diff saved to https://phabricator.wikimedia.org/P82860 and previous config saved to /var/cache/conftool/dbconfig/20250909-060947-ladsgroup.json [06:17:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T402763)', diff saved to https://phabricator.wikimedia.org/P82861 and previous config saved to /var/cache/conftool/dbconfig/20250909-061706-ladsgroup.json [06:17:11] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [06:18:41] (03CR) 10Brouberol: [C:03+1] opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [06:21:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 6.403 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:24:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P82862 and previous config saved to /var/cache/conftool/dbconfig/20250909-062454-ladsgroup.json [06:28:57] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:32:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P82863 and previous config saved to /var/cache/conftool/dbconfig/20250909-063213-ladsgroup.json [06:34:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:36:32] (03CR) 10Muehlenhoff: [C:03+2] Point webproxy in esams to install3004 [dns] - 10https://gerrit.wikimedia.org/r/1185918 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [06:36:38] !log jmm@dns1004 START - running authdns-update [06:37:41] !log jmm@dns1004 END - running authdns-update [06:38:54] !log jmm@dns1004 START - running authdns-update [06:39:53] !log jmm@dns1004 END - running authdns-update [06:40:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P82864 and previous config saved to /var/cache/conftool/dbconfig/20250909-064002-ladsgroup.json [06:41:34] (03CR) 10Muehlenhoff: [C:03+2] Update DHCP server in esams [puppet] - 10https://gerrit.wikimedia.org/r/1185941 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [06:41:51] (03CR) 10Muehlenhoff: [C:03+2] Update DHCP server in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1185940 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [06:43:11] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.1 point update - https://phabricator.wikimedia.org/T403815#11161122 (10MoritzMuehlenhoff) [06:44:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:45:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11161125 (10MoritzMuehlenhoff) [06:47:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P82865 and previous config saved to /var/cache/conftool/dbconfig/20250909-064721-ladsgroup.json [06:53:31] (03PS3) 10Superpes15: Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) [06:53:40] (03CR) 10CI reject: [V:04-1] Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [06:54:12] (03PS4) 10Superpes15: Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) [06:54:22] (03CR) 10CI reject: [V:04-1] Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [06:55:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T402925)', diff saved to https://phabricator.wikimedia.org/P82866 and previous config saved to /var/cache/conftool/dbconfig/20250909-065509-ladsgroup.json [06:55:14] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [06:55:26] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [06:55:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T402925)', diff saved to https://phabricator.wikimedia.org/P82867 and previous config saved to /var/cache/conftool/dbconfig/20250909-065533-ladsgroup.json [06:56:41] (03PS6) 10Superpes15: Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) [06:56:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T402925)', diff saved to https://phabricator.wikimedia.org/P82868 and previous config saved to /var/cache/conftool/dbconfig/20250909-065643-ladsgroup.json [06:57:32] (03CR) 10CI reject: [V:04-1] Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [06:58:09] (03PS7) 10Superpes15: Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) [06:58:56] (03CR) 10A smart kitten: Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 (owner: 10Jdlrobson) [06:59:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:04] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:02:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T402763)', diff saved to https://phabricator.wikimedia.org/P82869 and previous config saved to /var/cache/conftool/dbconfig/20250909-070228-ladsgroup.json [07:02:33] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [07:02:44] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1243.eqiad.wmnet with reason: Maintenance [07:02:50] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:02:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1243 (T402763)', diff saved to https://phabricator.wikimedia.org/P82870 and previous config saved to /var/cache/conftool/dbconfig/20250909-070251-ladsgroup.json [07:04:52] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.651 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:06:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.038 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:06:36] (03PS8) 10Superpes15: Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) [07:07:43] (03CR) 10CI reject: [V:04-1] Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [07:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:11:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P82871 and previous config saved to /var/cache/conftool/dbconfig/20250909-071150-ladsgroup.json [07:12:50] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:22:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T402763)', diff saved to https://phabricator.wikimedia.org/P82872 and previous config saved to /var/cache/conftool/dbconfig/20250909-072230-ladsgroup.json [07:22:34] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [07:26:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P82873 and previous config saved to /var/cache/conftool/dbconfig/20250909-072658-ladsgroup.json [07:27:38] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 1 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:30:45] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.1 point update - https://phabricator.wikimedia.org/T403815#11161194 (10MoritzMuehlenhoff) [07:30:49] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.1 point update - https://phabricator.wikimedia.org/T403815#11161199 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done! [07:32:18] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1182849 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [07:37:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P82874 and previous config saved to /var/cache/conftool/dbconfig/20250909-073737-ladsgroup.json [07:38:26] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:30] (03CR) 10Vgutierrez: [C:03+2] benthos webrequest: Add ja3n X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1182849 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [07:42:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T402925)', diff saved to https://phabricator.wikimedia.org/P82875 and previous config saved to /var/cache/conftool/dbconfig/20250909-074205-ladsgroup.json [07:42:11] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [07:42:22] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [07:42:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T402925)', diff saved to https://phabricator.wikimedia.org/P82876 and previous config saved to /var/cache/conftool/dbconfig/20250909-074229-ladsgroup.json [07:48:26] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:49:55] (03CR) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [07:51:26] (03PS1) 10Brouberol: airflow-dev: fix the task logs location [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186425 [07:52:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P82877 and previous config saved to /var/cache/conftool/dbconfig/20250909-075245-ladsgroup.json [07:57:34] (03CR) 10Muehlenhoff: [C:03+2] Apply config to enable new Bird release on the role/esams level [puppet] - 10https://gerrit.wikimedia.org/r/1185938 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [07:59:41] (03CR) 10Stevemunene: [C:03+1] "Lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186425 (owner: 10Brouberol) [08:00:56] (03CR) 10Brouberol: [C:03+2] airflow-dev: fix the task logs location [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186425 (owner: 10Brouberol) [08:01:32] !log push pfw policy - T403972 [08:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:07:23] (03PS1) 10Vgutierrez: turnilo webrequest: Add JA3N X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1186430 (https://phabricator.wikimedia.org/T400270) [08:07:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T402763)', diff saved to https://phabricator.wikimedia.org/P82878 and previous config saved to /var/cache/conftool/dbconfig/20250909-080752-ladsgroup.json [08:07:57] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [08:08:02] (03PS1) 10Ilias Sarantopoulos: ml-services: enable GPU for tone check in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186431 (https://phabricator.wikimedia.org/T403378) [08:08:08] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1244.eqiad.wmnet with reason: Maintenance [08:08:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1244 (T402763)', diff saved to https://phabricator.wikimedia.org/P82879 and previous config saved to /var/cache/conftool/dbconfig/20250909-080815-ladsgroup.json [08:11:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:11:29] (03CR) 10AikoChou: [C:03+1] ml-services: enable GPU for tone check in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186431 (https://phabricator.wikimedia.org/T403378) (owner: 10Ilias Sarantopoulos) [08:13:54] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186431 (https://phabricator.wikimedia.org/T403378) (owner: 10Ilias Sarantopoulos) [08:16:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:21:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:22:14] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enable GPU for tone check in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186431 (https://phabricator.wikimedia.org/T403378) (owner: 10Ilias Sarantopoulos) [08:23:50] (03Merged) 10jenkins-bot: ml-services: enable GPU for tone check in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186431 (https://phabricator.wikimedia.org/T403378) (owner: 10Ilias Sarantopoulos) [08:27:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:28:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T402763)', diff saved to https://phabricator.wikimedia.org/P82880 and previous config saved to /var/cache/conftool/dbconfig/20250909-082842-ladsgroup.json [08:28:47] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [08:30:29] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [08:31:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:32:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:32:40] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-redacteddb1001.eqiad.wmnet [08:33:32] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11161421 (10MoritzMuehlenhoff) [08:33:55] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1182695 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [08:36:00] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install3003.wikimedia.org [08:38:35] (03PS14) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:41:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:42:10] (03PS1) 10Muehlenhoff: Remove Ganeti role from ganeti3008 [puppet] - 10https://gerrit.wikimedia.org/r/1186438 (https://phabricator.wikimedia.org/T402259) [08:42:12] (03PS1) 10Muehlenhoff: Remove ganeti02/esams from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1186439 (https://phabricator.wikimedia.org/T402259) [08:42:17] FIRING: ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:42:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:43:10] (03PS2) 10Muehlenhoff: Remove ganeti02/esams from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1186439 (https://phabricator.wikimedia.org/T402259) [08:43:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P82881 and previous config saved to /var/cache/conftool/dbconfig/20250909-084350-ladsgroup.json [08:45:02] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host an-redacteddb1001.eqiad.wmnet [08:45:16] PROBLEM - MariaDB Replica IO: s1 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:16] PROBLEM - MariaDB Replica IO: s3 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:16] PROBLEM - MariaDB Replica IO: s2 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:24] PROBLEM - MariaDB Replica IO: s5 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:24] PROBLEM - MariaDB Replica IO: s4 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:26] PROBLEM - MariaDB Replica Lag: s6 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:26] PROBLEM - MariaDB Replica Lag: s5 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:26] PROBLEM - MariaDB Replica Lag: s7 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:26] PROBLEM - MariaDB Replica Lag: s8 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:26] PROBLEM - MariaDB Replica SQL: s1 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:26] PROBLEM - MariaDB Replica SQL: s2 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:27] PROBLEM - MariaDB Replica SQL: s4 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:27] PROBLEM - MariaDB Replica SQL: s3 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:28] PROBLEM - MariaDB Replica SQL: s5 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:28] PROBLEM - MariaDB Replica SQL: s6 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:30] PROBLEM - MariaDB Replica SQL: s7 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:30] PROBLEM - MariaDB Replica SQL: s8 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:30] PROBLEM - MariaDB read only s2 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:30] PROBLEM - MariaDB read only s1 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:31] PROBLEM - MariaDB read only s4 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:31] PROBLEM - MariaDB read only s3 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:32] PROBLEM - MariaDB read only s5 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:32] PROBLEM - MariaDB read only s6 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:33] PROBLEM - MariaDB read only s7 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3317 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:33] PROBLEM - MariaDB read only s8 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:34] PROBLEM - MariaDB read only wikireplica-s2 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:34] PROBLEM - MariaDB read only wikireplica-s3 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:35] PROBLEM - MariaDB read only wikireplica-s1 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:35] PROBLEM - MariaDB read only wikireplica-s4 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:36] PROBLEM - MariaDB read only wikireplica-s5 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:36] PROBLEM - MariaDB read only wikireplica-s6 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:37] PROBLEM - MariaDB read only wikireplica-s7 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3317 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:37] PROBLEM - MariaDB read only wikireplica-s8 on an-redacteddb1001 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:38] PROBLEM - mysqld processes on an-redacteddb1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:46:49] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install3003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:47:17] RESOLVED: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:48:33] 10SRE-tools, 06Infrastructure-Foundations, 10netops: Evaluate automatic MAC-based DHCP for production servers - https://phabricator.wikimedia.org/T396712#11161490 (10ayounsi) 05Open→03Resolved a:03ayounsi Evaluation is done and @jhathaway has rolled out UUID + MAC fallback DHCP (with the `--no82` c... [08:49:54] jmm@cumin2002 decommission (PID 1316756) is awaiting input [08:50:20] PROBLEM - MariaDB Replica IO: s6 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:50:20] PROBLEM - MariaDB Replica Lag: s4 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:50:30] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [08:51:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install3003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:51:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:51:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install3003.wikimedia.org [08:51:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11161523 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install3003.wikimedia.org` - install3003.wikimedia.org (**PA... [08:54:34] 06SRE, 06Release-Engineering-Team, 06serviceops: docker-registry "Last updated at" text hiding under scrollbar - https://phabricator.wikimedia.org/T404008#11161533 (10LSobanski) [08:54:43] 06SRE, 06Release-Engineering-Team, 06serviceops: docker-registry "Last updated at" time should specify TZ - https://phabricator.wikimedia.org/T404010#11161536 (10LSobanski) [08:54:51] 06SRE, 06Release-Engineering-Team, 06serviceops: docker-registry will show different last updated time as you refresh the page... - https://phabricator.wikimedia.org/T404011#11161537 (10LSobanski) [08:55:20] PROBLEM - MariaDB Replica IO: s7 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:32] (03CR) 10MVernon: [C:03+2] sretest2010: set to be installed like a new ms-be* node [puppet] - 10https://gerrit.wikimedia.org/r/1185973 (https://phabricator.wikimedia.org/T394357) (owner: 10MVernon) [08:56:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:56:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [08:58:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P82882 and previous config saved to /var/cache/conftool/dbconfig/20250909-085858-ladsgroup.json [09:00:20] PROBLEM - MariaDB Replica IO: s8 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:01:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:05:20] PROBLEM - MariaDB Replica Lag: s1 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:10:20] PROBLEM - MariaDB Replica Lag: s2 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:12:31] !log installing openssh security updates on Bullseye [09:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T402763)', diff saved to https://phabricator.wikimedia.org/P82883 and previous config saved to /var/cache/conftool/dbconfig/20250909-091405-ladsgroup.json [09:14:10] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [09:14:21] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1245.eqiad.wmnet with reason: Maintenance [09:15:20] PROBLEM - MariaDB Replica Lag: s3 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:16:53] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti02/esams from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1186439 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [09:18:14] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1186446 (https://phabricator.wikimedia.org/T404050) [09:19:17] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11161627 (10MoritzMuehlenhoff) [09:19:29] (03PS1) 10Ilias Sarantopoulos: ml-services: enable GPU for edit-check in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186447 (https://phabricator.wikimedia.org/T403378) [09:23:42] (03CR) 10Muehlenhoff: [C:03+2] Remove Ganeti role from ganeti3008 [puppet] - 10https://gerrit.wikimedia.org/r/1186438 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [09:23:59] (03PS1) 10Ilias Sarantopoulos: ml-services: enable GPU for edit-check in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186448 (https://phabricator.wikimedia.org/T403378) [09:24:34] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11161641 (10MoritzMuehlenhoff) [09:28:02] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [09:28:46] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1186430 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [09:29:02] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db2170.codfw.wmnet [09:29:11] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db2170 - Upgrading db2170.codfw.wmnet [09:29:18] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2170 - Upgrading db2170.codfw.wmnet [09:29:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bullseye [09:29:38] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11161655 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bullseye [09:31:19] (03CR) 10Vgutierrez: [C:03+2] turnilo webrequest: Add JA3N X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1186430 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [09:32:52] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1247.eqiad.wmnet with reason: Maintenance [09:33:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1247 (T402763)', diff saved to https://phabricator.wikimedia.org/P82889 and previous config saved to /var/cache/conftool/dbconfig/20250909-093259-ladsgroup.json [09:33:03] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [09:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:36:05] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2170.codfw.wmnet [09:36:36] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db2170* gradually with 4 steps - Work done [09:41:38] (03CR) 10Clément Goubert: [C:04-1] "The steps to remove the control plane from production [1] and as an etcd node [2] need to be followed prior to merging or running the deco" [puppet] - 10https://gerrit.wikimedia.org/r/1186006 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [09:41:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:42:08] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db2170* gradually with 4 steps - Work done [09:43:07] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db2170* gradually with 4 steps - Work done [09:45:48] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti3008.esams.wmnet [09:46:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:46:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:47:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11161726 (10MoritzMuehlenhoff) [09:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:49:34] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6880/console" [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [09:50:40] 06SRE, 06Infrastructure-Foundations, 10netops: Ganeti network config results in additional auto-conf IPv6 address - https://phabricator.wikimedia.org/T378335#11161728 (10cmooney) 05Open→03Declined Gonna close this one. I suspect we may be hitting an occasional issue due to this, which is being paper... [09:51:07] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6881/console" [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [09:51:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T402763)', diff saved to https://phabricator.wikimedia.org/P82891 and previous config saved to /var/cache/conftool/dbconfig/20250909-095158-ladsgroup.json [09:52:02] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [09:52:04] !log restarting turnilo [09:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:20] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1247 gradually with 4 steps - Maint over [09:54:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [09:54:52] jouncebot: nowandnext [09:54:52] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [09:54:52] In 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1000) [09:55:57] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6882/co" [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [09:56:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T404050 [09:56:59] T404050: Switchover s4 master (db2240 -> db2179) - https://phabricator.wikimedia.org/T404050 [09:57:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [09:58:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2179 from API/vslow/dump T404050', diff saved to https://phabricator.wikimedia.org/P82893 and previous config saved to /var/cache/conftool/dbconfig/20250909-095829-fceratto.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1000) [10:00:14] 06SRE, 06Infrastructure-Foundations, 10netops: gNMIc connection not working for cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T387018#11161778 (10ayounsi) 05Open→03Resolved a:03ayounsi cloudsw2-d5-eqiad is now gone. [10:01:15] (03CR) 10Hnowlan: [V:03+1 C:03+1] alert: Add Slack route to send Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [10:02:42] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1186446 (https://phabricator.wikimedia.org/T404050) (owner: 10Gerrit maintenance bot) [10:04:06] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11161791 (10elukey) @RLazarus Hi! I found this task while trying to remove all the occurrences of `1.23.10-2-s4-20231203` still running on the k8s clusters. Are you going t... [10:04:24] !log Starting s4 codfw failover from db2240 to db2179 - T404050 [10:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:28] T404050: Switchover s4 master (db2240 -> db2179) - https://phabricator.wikimedia.org/T404050 [10:05:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2179 to s4 primary T404050', diff saved to https://phabricator.wikimedia.org/P82896 and previous config saved to /var/cache/conftool/dbconfig/20250909-100534-fceratto.json [10:07:44] jmm@cumin2002 upgrade-firmware (PID 1352472) is awaiting input [10:09:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2240 API T404050', diff saved to https://phabricator.wikimedia.org/P82899 and previous config saved to /var/cache/conftool/dbconfig/20250909-100925-fceratto.json [10:09:30] T404050: Switchover s4 master (db2240 -> db2179) - https://phabricator.wikimedia.org/T404050 [10:09:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11161809 (10ayounsi) [10:10:47] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T404027 [10:10:51] T404027: Switchover s8 master (db1193 -> db1209) - https://phabricator.wikimedia.org/T404027 [10:11:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set db1209 with weight 0 T404027', diff saved to https://phabricator.wikimedia.org/P82900 and previous config saved to /var/cache/conftool/dbconfig/20250909-101103-ladsgroup.json [10:11:47] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11161822 (10elukey) I noticed the following occurrences of Buster images: dse-k8s-eqiad-152-namespace: datasets-config dse-k8s-eqiad-157-namespace: datasets-config-next e... [10:12:29] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#11161823 (10ayounsi) 05Open→03Resolved a:03ayounsi Closing that never-ending tracking task to focus on more specific sub-tasks now that all the ground work is done. [10:12:59] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: redefine IBGP definitions to support both Unicast & EVPN clusters - https://phabricator.wikimedia.org/T394530#11161826 (10cmooney) 05Open→03Resolved Closing this one, current status is both the Juniper & Nokia device definitions are the same, and... [10:13:27] jmm@cumin2002 upgrade-firmware (PID 1352472) is awaiting input [10:15:03] (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1186089 (https://phabricator.wikimedia.org/T404027) (owner: 10Gerrit maintenance bot) [10:15:04] 10ops-eqsin, 06SRE: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519#11161831 (10ayounsi) a:03RobH The physical anchor has been replaced by a VM, moving that task to DCops to recycle the failed hardware : https://netbox.wikimedia.org/dcim/devices/1287/ [10:15:09] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1186089 (https://phabricator.wikimedia.org/T404027) [10:15:10] FIRING: [2x] GanetiBGPDown: BGP session down between ganeti3008 and asw1-bw27-esams - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [10:15:10] (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1186089 (https://phabricator.wikimedia.org/T404027) (owner: 10Gerrit maintenance bot) [10:15:15] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1186089 (https://phabricator.wikimedia.org/T404027) (owner: 10Gerrit maintenance bot) [10:15:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet [10:18:07] !log Starting s8 eqiad failover from db1193 to db1209 - T404027 [10:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:11] T404027: Switchover s8 master (db1193 -> db1209) - https://phabricator.wikimedia.org/T404027 [10:18:11] (03PS1) 10Elukey: sre.hsots.provision: add sys-121c-tn2r-configg as special case [cookbooks] - 10https://gerrit.wikimedia.org/r/1186454 (https://phabricator.wikimedia.org/T399779) [10:18:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T404027', diff saved to https://phabricator.wikimedia.org/P82904 and previous config saved to /var/cache/conftool/dbconfig/20250909-101828-ladsgroup.json [10:19:26] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:19:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote db1209 to s8 primary and set section read-write T404027', diff saved to https://phabricator.wikimedia.org/P82906 and previous config saved to /var/cache/conftool/dbconfig/20250909-101945-ladsgroup.json [10:22:19] (03PS2) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1186090 (https://phabricator.wikimedia.org/T404027) [10:22:21] (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1186090 (https://phabricator.wikimedia.org/T404027) (owner: 10Gerrit maintenance bot) [10:22:23] (03CR) 10Ladsgroup: [V:03+2 C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1186090 (https://phabricator.wikimedia.org/T404027) (owner: 10Gerrit maintenance bot) [10:22:40] !log ladsgroup@dns1004 START - running authdns-update [10:23:45] !log ladsgroup@dns1004 END - running authdns-update [10:25:02] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM, thank you for deploying those!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186447 (https://phabricator.wikimedia.org/T403378) (owner: 10Ilias Sarantopoulos) [10:25:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2147.codfw.wmnet with reason: Maintenance [10:25:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T401906)', diff saved to https://phabricator.wikimedia.org/P82911 and previous config saved to /var/cache/conftool/dbconfig/20250909-102554-fceratto.json [10:25:58] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [10:26:06] jouncebot: nowandnext [10:26:06] For the next 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1000) [10:26:06] In 1 hour(s) and 33 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1200) [10:26:41] Anyone mind if I backport to the wmf branches? [10:26:53] (03PS1) 10Dreamy Jazz: Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch hook [extensions/CheckUser] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186457 (https://phabricator.wikimedia.org/T403111) [10:27:03] (03PS1) 10Dreamy Jazz: Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186458 (https://phabricator.wikimedia.org/T403111) [10:27:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T401906)', diff saved to https://phabricator.wikimedia.org/P82912 and previous config saved to /var/cache/conftool/dbconfig/20250909-102704-fceratto.json [10:27:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1193 T404027', diff saved to https://phabricator.wikimedia.org/P82913 and previous config saved to /var/cache/conftool/dbconfig/20250909-102709-ladsgroup.json [10:27:13] T404027: Switchover s8 master (db1193 -> db1209) - https://phabricator.wikimedia.org/T404027 [10:27:40] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1247 gradually with 4 steps - Maint over [10:27:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet [10:27:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti3008.esams.wmnet [10:28:13] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1193.eqiad.wmnet with reason: Glow up [10:28:53] (03CR) 10Dreamy Jazz: [C:03+2] Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186458 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [10:28:56] (03CR) 10Dreamy Jazz: [C:03+2] Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch hook [extensions/CheckUser] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186457 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [10:28:57] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:05] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2170* gradually with 4 steps - Work done [10:30:23] (03PS1) 10Dreamy Jazz: Add the CheckUserSuggestedInvestigationsGetSignals hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186460 (https://phabricator.wikimedia.org/T403111) [10:30:37] (03CR) 10Dreamy Jazz: [C:03+2] Add the CheckUserSuggestedInvestigationsGetSignals hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186460 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [10:30:50] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for db1193.eqiad.wmnet [10:30:57] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.depool db1193 - Upgrading db1193.eqiad.wmnet [10:31:04] !log ladsgroup@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) db1193 - Upgrading db1193.eqiad.wmnet [10:31:08] 06SRE, 06Infrastructure-Foundations, 10netops: Allow read-only users to view logs on Juniper devices - https://phabricator.wikimedia.org/T401378#11161907 (10cmooney) 05Open→03Resolved [10:31:39] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186461 [10:32:31] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11161919 (10cmooney) >>! In T400783#11107455, @Jclark-ctr wrote: > @cmooney @ayounsi It looks like there’s nothing I or Juniper can do unless the OS is updated. A reboo... [10:33:21] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:33:22] (03PS3) 10Dreamy Jazz: Define the CheckUserSuggestedInvestigationsBeforeCaseCreated hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186459 (https://phabricator.wikimedia.org/T403959) [10:33:36] (03CR) 10Dreamy Jazz: [C:03+2] Define the CheckUserSuggestedInvestigationsBeforeCaseCreated hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186459 (https://phabricator.wikimedia.org/T403959) (owner: 10Dreamy Jazz) [10:34:37] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enable GPU for edit-check in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186447 (https://phabricator.wikimedia.org/T403378) (owner: 10Ilias Sarantopoulos) [10:34:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2010.codfw.wmnet with OS bullseye [10:34:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186460 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [10:34:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186459 (https://phabricator.wikimedia.org/T403959) (owner: 10Dreamy Jazz) [10:34:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186458 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [10:34:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186457 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [10:34:55] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11161936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bullseye completed: - sretest2010 (*... [10:35:18] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:36:19] (03Merged) 10jenkins-bot: ml-services: enable GPU for edit-check in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186447 (https://phabricator.wikimedia.org/T403378) (owner: 10Ilias Sarantopoulos) [10:36:33] RECOVERY - MariaDB read only s8 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 4s, read_only: True, event_scheduler: False, 11.24 QPS, connection latency: 0.028318s, query latency: 0.012032s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:33] RECOVERY - MariaDB read only s2 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 22s, read_only: True, event_scheduler: False, 30.67 QPS, connection latency: 0.024999s, query latency: 0.001116s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:33] RECOVERY - MariaDB read only s4 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 15s, read_only: True, event_scheduler: False, 22.35 QPS, connection latency: 0.020473s, query latency: 0.000691s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:33] RECOVERY - MariaDB read only s5 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 12s, read_only: True, event_scheduler: False, 22.41 QPS, connection latency: 0.029109s, query latency: 0.003364s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:33] RECOVERY - MariaDB read only wikireplica-s2 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 22s, read_only: True, event_scheduler: False, 28.99 QPS, connection latency: 0.028240s, query latency: 0.000875s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:34] RECOVERY - MariaDB read only s1 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 24s, read_only: True, event_scheduler: False, 32.77 QPS, connection latency: 0.022258s, query latency: 0.000760s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:34] RECOVERY - MariaDB read only wikireplica-s1 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 24s, read_only: True, event_scheduler: False, 30.74 QPS, connection latency: 0.039727s, query latency: 0.000859s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:35] RECOVERY - MariaDB read only wikireplica-s3 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 20s, read_only: True, event_scheduler: False, 31.29 QPS, connection latency: 0.031561s, query latency: 0.002078s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:35] RECOVERY - MariaDB read only s6 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 9s, read_only: True, event_scheduler: False, 11.28 QPS, connection latency: 0.020997s, query latency: 0.000718s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:36] RECOVERY - MariaDB read only s7 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 7s, read_only: True, event_scheduler: False, 10.96 QPS, connection latency: 0.029791s, query latency: 0.015453s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:36] RECOVERY - MariaDB read only wikireplica-s5 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 12s, read_only: True, event_scheduler: False, 22.42 QPS, connection latency: 0.033639s, query latency: 0.000853s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:37] RECOVERY - MariaDB read only wikireplica-s4 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 16s, read_only: True, event_scheduler: False, 22.43 QPS, connection latency: 0.024185s, query latency: 0.000762s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:37] RECOVERY - MariaDB read only s3 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 21s, read_only: True, event_scheduler: False, 30.85 QPS, connection latency: 0.024150s, query latency: 0.000759s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3008.esams.wmnet with OS bookworm [10:36:41] RECOVERY - mysqld processes on an-redacteddb1001 is OK: PROCS OK: 8 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:36:41] RECOVERY - MariaDB read only wikireplica-s8 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 12s, read_only: True, event_scheduler: False, 19.87 QPS, connection latency: 0.019811s, query latency: 0.001019s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:41] RECOVERY - MariaDB read only wikireplica-s7 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 15s, read_only: True, event_scheduler: False, 19.85 QPS, connection latency: 0.037726s, query latency: 0.003089s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:41] RECOVERY - MariaDB read only wikireplica-s6 on an-redacteddb1001 is OK: Version 10.11.11-MariaDB, Uptime 17s, read_only: True, event_scheduler: False, 11.18 QPS, connection latency: 0.047525s, query latency: 0.001452s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:37:17] RECOVERY - MariaDB Replica IO: s1 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:17] RECOVERY - MariaDB Replica IO: s3 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:17] RECOVERY - MariaDB Replica IO: s2 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:23] RECOVERY - MariaDB Replica IO: s4 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:23] RECOVERY - MariaDB Replica IO: s6 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:23] RECOVERY - MariaDB Replica IO: s8 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:23] RECOVERY - MariaDB Replica IO: s7 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:23] RECOVERY - MariaDB Replica IO: s5 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:27] RECOVERY - MariaDB Replica SQL: s4 on an-redacteddb1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:27] RECOVERY - MariaDB Replica SQL: s1 on an-redacteddb1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:27] RECOVERY - MariaDB Replica SQL: s3 on an-redacteddb1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:27] RECOVERY - MariaDB Replica SQL: s2 on an-redacteddb1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:27] RECOVERY - MariaDB Replica SQL: s6 on an-redacteddb1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:27] RECOVERY - MariaDB Replica SQL: s5 on an-redacteddb1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:33] RECOVERY - MariaDB Replica SQL: s7 on an-redacteddb1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:33] RECOVERY - MariaDB Replica SQL: s8 on an-redacteddb1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:39] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [10:38:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2010.codfw.wmnet [10:38:25] RECOVERY - MariaDB Replica Lag: s5 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:38:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1193.eqiad.wmnet [10:38:44] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#11161969 (10ayounsi) 05Open→03Resolved a:03ayounsi Closing that parent task to focus on the remaining sub... [10:39:06] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:39:25] RECOVERY - MariaDB Replica Lag: s6 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:39:25] RECOVERY - MariaDB Replica Lag: s8 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:40:10] 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11161973 (10Clement_Goubert) Silenced for 30 days again [10:40:21] RECOVERY - MariaDB Replica Lag: s2 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:40:21] RECOVERY - MariaDB Replica Lag: s3 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:40:25] RECOVERY - MariaDB Replica Lag: s7 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:41:07] !log btullis@cumin1003 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [10:41:41] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [10:42:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P82919 and previous config saved to /var/cache/conftool/dbconfig/20250909-104211-fceratto.json [10:43:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1172 for clone', diff saved to https://phabricator.wikimedia.org/P82920 and previous config saved to /var/cache/conftool/dbconfig/20250909-104321-ladsgroup.json [10:43:23] RECOVERY - MariaDB Replica Lag: s1 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:44:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2010.codfw.wmnet [10:44:13] (03Merged) 10jenkins-bot: Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186458 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [10:45:47] (03Merged) 10jenkins-bot: Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch hook [extensions/CheckUser] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186457 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [10:45:49] (03Merged) 10jenkins-bot: Add the CheckUserSuggestedInvestigationsGetSignals hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186460 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [10:46:12] (03Merged) 10jenkins-bot: Define the CheckUserSuggestedInvestigationsBeforeCaseCreated hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1186459 (https://phabricator.wikimedia.org/T403959) (owner: 10Dreamy Jazz) [10:47:05] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db1172.eqiad.wmnet onto db1193.eqiad.wmnet [10:47:05] !log ladsgroup@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1172.eqiad.wmnet onto db1193.eqiad.wmnet [10:47:28] !log btullis@cumin1003 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [10:47:37] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#11162004 (10ayounsi) 05Open→03Resolved a:03ayounsi All the tooling, metrics and examples are there for the service owners to setup their alert... [10:48:00] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db1172.eqiad.wmnet onto db1193.eqiad.wmnet [10:48:13] !log btullis@cumin1003 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [10:48:28] (03CR) 10Clément Goubert: "Cleaning up." [puppet] - 10https://gerrit.wikimedia.org/r/1182815 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [10:48:48] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [10:48:55] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162020 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm [10:49:15] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:49:22] RECOVERY - MariaDB Replica Lag: s4 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 58.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:49:28] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:50:31] 06SRE, 06Infrastructure-Foundations, 10netops: Enable BFD on 'core' EBGP peerings from L3 switches to CRs - https://phabricator.wikimedia.org/T374452#11162024 (10cmooney) 05Open→03Declined Not gonna implement this one for now, we can revisit if needed. [10:51:25] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1186460|Add the CheckUserSuggestedInvestigationsGetSignals hook (T403111)]], [[gerrit:1186459|Define the CheckUserSuggestedInvestigationsBeforeCaseCreated hook (T403959)]], [[gerrit:1186457|Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch hook (T403111)]], [[gerrit:1186458|Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch [10:51:25] hook (T403111)]] [10:51:30] T403111: Suggested investigations: Define hooks to be used by private signal logic to define and implement a signal - https://phabricator.wikimedia.org/T403111 [10:51:31] T403959: Suggested investigations: Define hook to add more users to a case when creating it - https://phabricator.wikimedia.org/T403959 [10:52:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11162039 (10VRiley-WMF) [10:53:03] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS bookworm [10:53:17] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm executed with errors: - sre... [10:53:32] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [10:53:41] (03PS2) 10Elukey: sre.hsots.provision: add sys-121c-tn2r-configg as special case [cookbooks] - 10https://gerrit.wikimedia.org/r/1186454 (https://phabricator.wikimedia.org/T399779) [10:53:47] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162041 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm [10:54:23] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:54:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [10:54:35] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:55:48] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11162046 (10VRiley-WMF) Reached out to the company to inquire about boxing and sending the older unit back [10:56:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11162048 (10VRiley-WMF) es1056 seems to be having issues imagining. Looking into this. [10:56:22] (03PS3) 10Elukey: sre.hsots.provision: add sys-121c-tn2r-configg as special case [cookbooks] - 10https://gerrit.wikimedia.org/r/1186454 (https://phabricator.wikimedia.org/T399779) [10:56:46] k8s deployment is running more slow than usual [10:56:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T400442#11162050 (10VRiley-WMF) 05Open→03Resolved [10:57:06] I'm not modifying i18n files, so not sure exactly what might be causing it [10:57:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P82921 and previous config saved to /var/cache/conftool/dbconfig/20250909-105719-fceratto.json [10:57:32] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS bookworm [10:57:38] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162055 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm executed with errors: - sre... [10:57:57] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [10:58:08] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162057 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm [10:58:09] i.e. deployment to the k8s testservers has not finished but started ast 10:53 [10:58:23] *at 10:53 UTC [10:58:30] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1186460|Add the CheckUserSuggestedInvestigationsGetSignals hook (T403111)]], [[gerrit:1186459|Define the CheckUserSuggestedInvestigationsBeforeCaseCreated hook (T403959)]], [[gerrit:1186457|Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch hook (T403111)]], [[gerrit:1186458|Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch hook (T403111 [10:58:30] )]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:58:35] T403111: Suggested investigations: Define hooks to be used by private signal logic to define and implement a signal - https://phabricator.wikimedia.org/T403111 [10:58:35] T403959: Suggested investigations: Define hook to add more users to a case when creating it - https://phabricator.wikimedia.org/T403959 [10:59:15] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11162063 (10elukey) @Jhancock.wm Hi! I created a quick patch for the host in https://gerrit.wikimedia.org/r/1186454, I used test-cookbook and the host provisioned correctly... [10:59:29] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [11:00:27] I'm not seeing anything wrong per se [11:01:25] !log btullis@cumin1003 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [11:02:20] Image pull took a little longer on a couple hosts (around 2 minutes) [11:02:29] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1008.eqiad.wmnet [11:03:03] ah I know [11:03:08] it's the php 8.3 images [11:03:26] because there are no layers for them on most hosts, since they are only deployed for mw-debug namespaces [11:03:48] Compare [11:03:57] 9m52s Normal Pulled pod/mw-debug.eqiad.pinkunicorn-6899db78db-nxbd2 Successfully pulled image "docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-09-09-105200-publish-81" in 2.392115319s [11:04:04] 7m31s Normal Pulled pod/mw-debug.eqiad.next-68dbb99bd-wbxwr Successfully pulled image "docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-09-09-105200-publish-83" in 2m23.831923781s [11:04:23] Ah, I see [11:04:51] (03CR) 10Muehlenhoff: [C:03+2] maps: Remove unused Hiera option [puppet] - 10https://gerrit.wikimedia.org/r/1185051 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:06:33] mvernon@cumin2002 reimage (PID 1391211) is awaiting input [11:06:52] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186460|Add the CheckUserSuggestedInvestigationsGetSignals hook (T403111)]], [[gerrit:1186459|Define the CheckUserSuggestedInvestigationsBeforeCaseCreated hook (T403959)]], [[gerrit:1186457|Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch hook (T403111)]], [[gerrit:1186458|Follow-up: Add the CheckUserSuggestedInvestigationsSignalMatch [11:06:53] hook (T403111)]] (duration: 15m 27s) [11:06:58] T403111: Suggested investigations: Define hooks to be used by private signal logic to define and implement a signal - https://phabricator.wikimedia.org/T403111 [11:06:58] T403959: Suggested investigations: Define hook to add more users to a case when creating it - https://phabricator.wikimedia.org/T403959 [11:07:06] I'm done with my deploys [11:07:31] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1008.eqiad.wmnet [11:07:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [11:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:09:19] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:10:11] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:12:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T401906)', diff saved to https://phabricator.wikimedia.org/P82922 and previous config saved to /var/cache/conftool/dbconfig/20250909-111226-fceratto.json [11:12:31] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:12:41] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11162091 (10elukey) ml-serve1008 done: ` START - Cookbook sre.hosts.provision for host ml-serve1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_REST... [11:12:43] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [11:12:43] (03CR) 10Elukey: "tested in https://phabricator.wikimedia.org/T401964#11162091" [cookbooks] - 10https://gerrit.wikimedia.org/r/1185978 (https://phabricator.wikimedia.org/T401964) (owner: 10Elukey) [11:12:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T401906)', diff saved to https://phabricator.wikimedia.org/P82923 and previous config saved to /var/cache/conftool/dbconfig/20250909-111250-fceratto.json [11:14:40] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:14:41] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:14:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:15:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T401906)', diff saved to https://phabricator.wikimedia.org/P82924 and previous config saved to /var/cache/conftool/dbconfig/20250909-111459-fceratto.json [11:15:32] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1008.eqiad.wmnet [11:15:33] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1008.eqiad.wmnet [11:15:49] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11162101 (10elukey) [11:19:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:25:15] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1186477 (https://phabricator.wikimedia.org/T404067) [11:27:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T404067 [11:27:09] T404067: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T404067 [11:27:19] (03PS9) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1185939 (https://phabricator.wikimedia.org/T402611) [11:27:57] (03PS1) 10Arnaudb: Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1186478 [11:28:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2192 from API/vslow/dump T404067', diff saved to https://phabricator.wikimedia.org/P82925 and previous config saved to /var/cache/conftool/dbconfig/20250909-112828-fceratto.json [11:30:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P82926 and previous config saved to /var/cache/conftool/dbconfig/20250909-113006-fceratto.json [11:32:39] !log btullis@cumin1003 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [11:33:11] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1056.eqiad.wmnet with OS bookworm [11:33:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11162170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm [11:34:16] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1186477 (https://phabricator.wikimedia.org/T404067) (owner: 10Gerrit maintenance bot) [11:35:03] (03CR) 10Btullis: [C:03+1] airflow-dev: fix the task logs location [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186425 (owner: 10Brouberol) [11:35:37] !log update bookworm d-i image T403852 [11:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:41] T403852: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852 [11:35:57] !log Starting s5 codfw failover from db2213 to db2192 - T404067 [11:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:01] T404067: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T404067 [11:36:24] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti3008.esams.wmnet with OS bookworm [11:37:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2192 to s5 primary T404067', diff saved to https://phabricator.wikimedia.org/P82927 and previous config saved to /var/cache/conftool/dbconfig/20250909-113740-fceratto.json [11:39:01] !log btullis@cumin1003 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [11:39:17] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:41:21] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:43:24] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove ganeti02 VIP in esams - jmm@cumin2002" [11:43:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove ganeti02 VIP in esams - jmm@cumin2002" [11:43:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:45:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P82928 and previous config saved to /var/cache/conftool/dbconfig/20250909-114514-fceratto.json [11:46:03] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS bookworm [11:46:11] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162235 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm executed with errors: - sre... [11:47:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [11:47:07] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm [11:47:14] !log btullis@cumin1003 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [11:47:36] (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1186478 (owner: 10Arnaudb) [11:48:46] (03PS1) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1186484 [11:49:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:15] 07sre-alert-triage, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Alert in need of triage: PybalBackendDown (instance cirrussearch2091:0) - https://phabricator.wikimedia.org/T399161#11162274 (10BTullis) 05Open→03Resolved a:03BTullis There are no active `PybalBackendDown` alerts at present, so this c... [11:52:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set dbctl values for db2213 T404067', diff saved to https://phabricator.wikimedia.org/P82929 and previous config saved to /var/cache/conftool/dbconfig/20250909-115245-fceratto.json [11:52:51] T404067: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T404067 [11:53:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [11:56:31] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11162318 (10MoritzMuehlenhoff) [11:58:32] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1200) [12:00:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T401906)', diff saved to https://phabricator.wikimedia.org/P82932 and previous config saved to /var/cache/conftool/dbconfig/20250909-120021-fceratto.json [12:00:26] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:00:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [12:00:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T401906)', diff saved to https://phabricator.wikimedia.org/P82933 and previous config saved to /var/cache/conftool/dbconfig/20250909-120043-fceratto.json [12:02:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T401906)', diff saved to https://phabricator.wikimedia.org/P82934 and previous config saved to /var/cache/conftool/dbconfig/20250909-120254-fceratto.json [12:04:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [12:05:07] ladsgroup@cumin1002 clone (PID 1356026) is awaiting input [12:08:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2213.codfw.wmnet with reason: Maintenance [12:08:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2213 (T401906)', diff saved to https://phabricator.wikimedia.org/P82935 and previous config saved to /var/cache/conftool/dbconfig/20250909-120825-fceratto.json [12:08:29] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:08:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509#11162346 (10MoritzMuehlenhoff) [12:08:54] (03PS1) 10Stevemunene: dse-k8s: Augment the dse-k8s cluster namespaces. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186487 (https://phabricator.wikimedia.org/T404068) [12:09:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509#11162348 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff These got reimaged as part of the transition of esams to routed Ganeti [12:09:26] 06SRE, 06Infrastructure-Foundations: repeated Ganeti VMs deadlocks due to DRBD bug on bullseye - https://phabricator.wikimedia.org/T348730#11162354 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All Ganeti servers are now on Bookworm! [12:11:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T401906)', diff saved to https://phabricator.wikimedia.org/P82936 and previous config saved to /var/cache/conftool/dbconfig/20250909-121059-fceratto.json [12:12:54] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db1172 gradually with 4 steps - Pool db1172.eqiad.wmnet in after cloning [12:13:36] (03PS10) 10Ayounsi: Use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 [12:15:27] (03CR) 10Cmelo: [C:03+1] Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180890 (https://phabricator.wikimedia.org/T397476) (owner: 10Daimona Eaytoy) [12:18:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P82938 and previous config saved to /var/cache/conftool/dbconfig/20250909-121802-fceratto.json [12:19:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:21:51] (03CR) 10Ayounsi: Use Homer to configure the network (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [12:21:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2010.codfw.wmnet with OS bookworm [12:21:58] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm completed: - sretest2010 (*... [12:23:43] (03CR) 10Ayounsi: "Let's start with this. Eventually we could add some code to run homer automatically if the server's switch is Nokia, but not needed for no" [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [12:24:46] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11162415 (10MSantos) [12:25:39] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11162427 (10MSantos) [12:26:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P82939 and previous config saved to /var/cache/conftool/dbconfig/20250909-122607-fceratto.json [12:28:00] vriley@cumin1003 reimage (PID 1777211) is awaiting input [12:32:17] FIRING: [2x] ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:33:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P82941 and previous config saved to /var/cache/conftool/dbconfig/20250909-123309-fceratto.json [12:37:17] RESOLVED: [2x] ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:38:23] !log ladsgroup@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 12:00:00 on db2240.codfw.wmnet with reason: Maintenance [12:38:52] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2240.codfw.wmnet with reason: Maintenance [12:39:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2240 (T402925)', diff saved to https://phabricator.wikimedia.org/P82943 and previous config saved to /var/cache/conftool/dbconfig/20250909-123859-ladsgroup.json [12:39:03] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [12:41:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P82944 and previous config saved to /var/cache/conftool/dbconfig/20250909-124114-fceratto.json [12:41:35] (03PS1) 10Brouberol: airflow-dev: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186497 [12:42:29] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11162481 (10Jclark-ctr) a:05bking→03Jclark-ctr [12:43:02] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2213.codfw.wmnet with reason: Maintenance [12:43:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2213 (T402925)', diff saved to https://phabricator.wikimedia.org/P82945 and previous config saved to /var/cache/conftool/dbconfig/20250909-124309-ladsgroup.json [12:44:06] (03CR) 10Btullis: [C:03+1] airflow-dev: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186497 (owner: 10Brouberol) [12:45:31] (03CR) 10Brouberol: [C:03+2] airflow-dev: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186497 (owner: 10Brouberol) [12:46:17] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:47:42] (03CR) 10Cathal Mooney: [C:03+2] cephosd: un-set bird bgp neighbors rather than override for each host [puppet] - 10https://gerrit.wikimedia.org/r/1170543 (owner: 10Cathal Mooney) [12:48:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T401906)', diff saved to https://phabricator.wikimedia.org/P82947 and previous config saved to /var/cache/conftool/dbconfig/20250909-124817-fceratto.json [12:48:22] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:48:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2199.codfw.wmnet with reason: Maintenance [12:48:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2206.codfw.wmnet with reason: Maintenance [12:48:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T401906)', diff saved to https://phabricator.wikimedia.org/P82948 and previous config saved to /var/cache/conftool/dbconfig/20250909-124847-fceratto.json [12:50:33] (03PS4) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1186484 [12:50:34] (03CR) 10Arnaudb: [C:03+2] "will follow up with sanity revert preshot" [puppet] - 10https://gerrit.wikimedia.org/r/1186484 (owner: 10Arnaudb) [12:50:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T401906)', diff saved to https://phabricator.wikimedia.org/P82949 and previous config saved to /var/cache/conftool/dbconfig/20250909-125058-fceratto.json [12:51:08] (03PS1) 10Arnaudb: Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1186498 [12:51:17] FIRING: [3x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:24] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:53:32] PROBLEM - BFD status on lsw1-d2-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:54:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T402925)', diff saved to https://phabricator.wikimedia.org/P82950 and previous config saved to /var/cache/conftool/dbconfig/20250909-125429-ladsgroup.json [12:54:34] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [12:54:41] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11162524 (10Jclark-ctr) [12:55:26] PROBLEM - BFD status on lsw1-a7-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:56:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:56:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T401906)', diff saved to https://phabricator.wikimedia.org/P82951 and previous config saved to /var/cache/conftool/dbconfig/20250909-125621-fceratto.json [12:56:26] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:57:11] !log test hotfix for doh3006 v6 bird [12:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:22] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1172 gradually with 4 steps - Pool db1172.eqiad.wmnet in after cloning [12:59:38] 06SRE, 06Traffic: Setting up Wikimedia Trust and Safety Help Center with Zendesk product: Seeking Guidance on host mapping - https://phabricator.wikimedia.org/T400952#11162569 (10JAbrams) @ssingh, thanks for your reply and for explaining. I understand now about the canonical domain and why we can’t redirect w... [12:59:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11162570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm executed with errors: - es1056 (**F... [13:00:09] Urbanecm and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1300). [13:00:09] No Gerrit patches in the queue for this window AFAICS. [13:01:17] RESOLVED: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:01:36] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2184.codfw.wmnet with reason: mariadb upgrade [13:01:55] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080 (10Johannes_Richter_WMDE) 03NEW [13:02:56] !log upgrading backup1-codfw db2184 mariadb package [13:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:06] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:04:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3008.esams.wmnet with OS bookworm [13:04:47] !incidents [13:04:48] No incidents occurred in the past 24 hours for team SRE [13:05:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [13:05:26] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie [13:06:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P82954 and previous config saved to /var/cache/conftool/dbconfig/20250909-130606-fceratto.json [13:07:11] (03CR) 10Volans: [C:03+1] "LGTM, I don't think it affects any existing workflow for now being the flag optional. Before prime-time it will need to test if dcops can " [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [13:07:59] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11162646 (10Tobi_WMDE_SW) I endorse this request by @Johannes_Richter_WMDE. He is part of the Technical Wishes team at WMDE and needs access for the stated reasons. [13:09:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P82956 and previous config saved to /var/cache/conftool/dbconfig/20250909-130937-ladsgroup.json [13:09:39] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [13:09:48] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162652 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors: - srete... [13:10:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [13:10:13] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162654 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie [13:10:24] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:12:17] FIRING: ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:12:44] (03PS2) 10Elukey: sre.hosts.provision: expand Supermicro models with no PXE devs in BIOS [cookbooks] - 10https://gerrit.wikimedia.org/r/1185975 (https://phabricator.wikimedia.org/T401964) [13:12:44] (03PS2) 10Elukey: sre.host.provision: move WebServer.1#HostHeaderCheck as optional [cookbooks] - 10https://gerrit.wikimedia.org/r/1185978 (https://phabricator.wikimedia.org/T401964) [13:12:44] (03PS4) 10Elukey: sre.hsots.provision: add sys-121c-tn2r-configg as special case [cookbooks] - 10https://gerrit.wikimedia.org/r/1186454 (https://phabricator.wikimedia.org/T399779) [13:12:54] PROBLEM - BFD status on lsw1-c2-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:13:15] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:13:28] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:13:44] (03CR) 10Elukey: "Tested with ml-lab1002, all good." [cookbooks] - 10https://gerrit.wikimedia.org/r/1185975 (https://phabricator.wikimedia.org/T401964) (owner: 10Elukey) [13:14:28] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:15:08] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11162671 (10ssingh) >>! In T389333#11159569, @CDobbins wrote: > @ssingh: that's right. I thought about this a bit over the weekend, and I think the easiest approach is going to... [13:15:22] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:15:23] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2184.codfw.wmnet [13:15:23] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2184.codfw.wmnet [13:15:40] PROBLEM - Host ml-lab1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:42] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1181.eqiad.wmnet [13:16:01] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1181 - Upgrading db1181.eqiad.wmnet [13:16:10] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [13:16:16] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors: - srete... [13:16:24] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:16:27] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [13:16:33] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie [13:17:09] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1181 - Upgrading db1181.eqiad.wmnet [13:17:17] FIRING: [7x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:18:52] RECOVERY - Host ml-lab1001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [13:19:43] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:20:12] (03PS1) 10Ayounsi: Routed Ganeti: install prefixes learned via Bird in kernel table [puppet] - 10https://gerrit.wikimedia.org/r/1186501 [13:21:12] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11162690 (10elukey) [13:21:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P82958 and previous config saved to /var/cache/conftool/dbconfig/20250909-132113-fceratto.json [13:21:46] (03CR) 10Cathal Mooney: [C:03+1] Routed Ganeti: install prefixes learned via Bird in kernel table [puppet] - 10https://gerrit.wikimedia.org/r/1186501 (owner: 10Ayounsi) [13:22:17] RESOLVED: [5x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:22:22] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:22:39] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11162695 (10elukey) Also completed ml-lab1001 and ml-lab1002, all good. [13:23:10] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1181.eqiad.wmnet [13:24:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P82959 and previous config saved to /var/cache/conftool/dbconfig/20250909-132444-ladsgroup.json [13:26:31] (03PS2) 10Ayounsi: Routed Ganeti: install prefixes learned via Bird in kernel table [puppet] - 10https://gerrit.wikimedia.org/r/1186501 [13:29:02] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162710 (10MatthewVernon) Hi @Jhancock.wm / @elukey . I've found 2 show-stoppers thus far (the second of which has left me blocked): # this node cannot PXE boot without manu... [13:29:40] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1181* gradually with 4 steps - Work done [13:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11162726 (10elukey) Hi folks! I worked on T401964 and those ML nodes were configured manually IIUC, so provisioning didn't properly support them. After some review it seems that they are a kind... [13:30:44] (03PS1) 10Jcrespo: dbbackups: Ingore dbprov1007/dbprov2007 backup warnings until fully setup [puppet] - 10https://gerrit.wikimedia.org/r/1186502 (https://phabricator.wikimedia.org/T403166) [13:30:52] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1186501 (owner: 10Ayounsi) [13:31:04] (03PS2) 10Jcrespo: dbbackups: Ignore dbprov1007/dbprov2007 backup warnings until fully setup [puppet] - 10https://gerrit.wikimedia.org/r/1186502 (https://phabricator.wikimedia.org/T403166) [13:31:46] (03PS3) 10Jcrespo: dbbackups: Ignore dbprov1007/2007 backup warnings until fully setup [puppet] - 10https://gerrit.wikimedia.org/r/1186502 (https://phabricator.wikimedia.org/T403166) [13:32:52] (03PS1) 10Muehlenhoff: maps: Moving the setting to disable paging to the role level [puppet] - 10https://gerrit.wikimedia.org/r/1186503 (https://phabricator.wikimedia.org/T381565) [13:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:01] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: install prefixes learned via Bird in kernel table [puppet] - 10https://gerrit.wikimedia.org/r/1186501 (owner: 10Ayounsi) [13:35:08] !log rolling https://gerrit.wikimedia.org/r/c/operations/puppet/+/1186501 one routed ganeti host a a time [13:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T401906)', diff saved to https://phabricator.wikimedia.org/P82961 and previous config saved to /var/cache/conftool/dbconfig/20250909-133621-fceratto.json [13:36:25] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3008.esams.wmnet with reason: host reimage [13:36:25] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:36:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2210.codfw.wmnet with reason: Maintenance [13:36:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T401906)', diff saved to https://phabricator.wikimedia.org/P82962 and previous config saved to /var/cache/conftool/dbconfig/20250909-133644-fceratto.json [13:38:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T401906)', diff saved to https://phabricator.wikimedia.org/P82963 and previous config saved to /var/cache/conftool/dbconfig/20250909-133855-fceratto.json [13:39:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3008.esams.wmnet with reason: host reimage [13:39:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T402925)', diff saved to https://phabricator.wikimedia.org/P82964 and previous config saved to /var/cache/conftool/dbconfig/20250909-133952-ladsgroup.json [13:39:56] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [13:40:08] (03PS1) 10Jcrespo: bacula: Ignore backup failures from people1005 & people2004 [puppet] - 10https://gerrit.wikimedia.org/r/1186509 (https://phabricator.wikimedia.org/T402596) [13:40:34] (03CR) 10Jcrespo: [C:03+2] dbbackups: Ignore dbprov1007/2007 backup warnings until fully setup [puppet] - 10https://gerrit.wikimedia.org/r/1186502 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [13:41:03] (03CR) 10Arnaudb: [C:03+2] Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1186498 (owner: 10Arnaudb) [13:42:00] (03PS2) 10Jcrespo: bacula: Ignore backup failures from people1005 & people2004 [puppet] - 10https://gerrit.wikimedia.org/r/1186509 (https://phabricator.wikimedia.org/T402596) [13:42:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186503 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:42:59] (03CR) 10Jcrespo: "Let me know your thoughts-- or maybe you weren't aware (?)." [puppet] - 10https://gerrit.wikimedia.org/r/1186509 (https://phabricator.wikimedia.org/T402596) (owner: 10Jcrespo) [13:44:36] (03PS1) 10Arnaudb: Revert^4 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1186512 [13:45:47] (03CR) 10Volans: [C:03+1] "Another corner case :( LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1185975 (https://phabricator.wikimedia.org/T401964) (owner: 10Elukey) [13:48:21] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1185978 (https://phabricator.wikimedia.org/T401964) (owner: 10Elukey) [13:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:49:30] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1186454 (https://phabricator.wikimedia.org/T399779) (owner: 10Elukey) [13:53:06] (03CR) 10Elukey: [C:03+1] Use Homer to configure the network (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [13:53:14] !log upgrading Envoy on config-master* T402584 [13:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:18] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [13:53:50] (03CR) 10Andrew Bogott: [C:03+2] Openstack: add wmcs-projectcleanup.py [puppet] - 10https://gerrit.wikimedia.org/r/1182648 (https://phabricator.wikimedia.org/T397648) (owner: 10Andrew Bogott) [13:53:53] (03CR) 10Andrew Bogott: [C:03+2] Openstack wmfkeystonehooks: don't clean up after project delete [puppet] - 10https://gerrit.wikimedia.org/r/1182649 (https://phabricator.wikimedia.org/T397648) (owner: 10Andrew Bogott) [13:54:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P82966 and previous config saved to /var/cache/conftool/dbconfig/20250909-135402-fceratto.json [13:54:11] (03CR) 10Elukey: [C:03+1] maps: Moving the setting to disable paging to the role level [puppet] - 10https://gerrit.wikimedia.org/r/1186503 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:55:03] (03CR) 10Clément Goubert: [C:03+1] shellbox-syntaxhighlight: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186009 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [13:56:00] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw [13:56:16] (03CR) 10Muehlenhoff: [C:03+2] maps: Moving the setting to disable paging to the role level [puppet] - 10https://gerrit.wikimedia.org/r/1186503 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:57:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw [13:57:22] (03Abandoned) 10Brouberol: airflow: emit lineage metadata to datahub via kafka instead of the GMS REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150595 (https://phabricator.wikimedia.org/T395106) (owner: 10Brouberol) [13:57:45] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [13:57:51] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11162873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors: - srete... [13:58:28] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db1193 gradually with 4 steps - Pool db1193.eqiad.wmnet in after cloning [13:58:49] (03CR) 10Bking: [C:03+2] dse-k8s: Introduce opensearch-operator namespace [puppet] - 10https://gerrit.wikimedia.org/r/1184568 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:59:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3008.esams.wmnet with OS bookworm [14:00:04] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1400) [14:00:28] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [14:01:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [14:02:02] (03PS1) 10Muehlenhoff: Add ganeti3008 to the routed Ganeti cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1186516 (https://phabricator.wikimedia.org/T402259) [14:03:07] !log upgrading Envoy on schema* T402584 [14:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:11] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Introduce rest-gateway-ro [puppet] - 10https://gerrit.wikimedia.org/r/1182852 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [14:03:11] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [14:04:14] (03PS1) 10Jforrester: Improve performance of preferred labels subquery [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186517 [14:05:05] (03CR) 10Hnowlan: [C:03+1] wmnet: Introduce rest-gateway-ro [dns] - 10https://gerrit.wikimedia.org/r/1182853 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [14:05:12] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Switch rest-gateway to A/P [puppet] - 10https://gerrit.wikimedia.org/r/1183084 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [14:09:01] FIRING: [4x] ProbeDown: Service wdqs1026:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P82969 and previous config saved to /var/cache/conftool/dbconfig/20250909-140910-fceratto.json [14:09:33] (03PS2) 10Arnaudb: Revert^4 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1186512 [14:09:33] (03CR) 10Arnaudb: "I'm not 100% sure about mtail scraping config, please let me know if you see something odd!" [puppet] - 10https://gerrit.wikimedia.org/r/1186512 (owner: 10Arnaudb) [14:09:33] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11162971 (10Peachey88) [14:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d8-eqiad - https://phabricator.wikimedia.org/T401240#11162972 (10VRiley-WMF) 05Open→03Resolved Finished cabling this up. There is something that I'm probably unaware of, but for now, I'm closing this. [14:10:04] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: expand Supermicro models with no PXE devs in BIOS [cookbooks] - 10https://gerrit.wikimedia.org/r/1185975 (https://phabricator.wikimedia.org/T401964) (owner: 10Elukey) [14:10:10] (03CR) 10Elukey: [C:03+2] sre.host.provision: move WebServer.1#HostHeaderCheck as optional [cookbooks] - 10https://gerrit.wikimedia.org/r/1185978 (https://phabricator.wikimedia.org/T401964) (owner: 10Elukey) [14:10:16] (03CR) 10Elukey: [C:03+2] sre.hsots.provision: add sys-121c-tn2r-configg as special case [cookbooks] - 10https://gerrit.wikimedia.org/r/1186454 (https://phabricator.wikimedia.org/T399779) (owner: 10Elukey) [14:14:01] RESOLVED: [4x] ProbeDown: Service wdqs1026:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:11] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1181* gradually with 4 steps - Work done [14:16:54] (03CR) 10Ayounsi: [C:03+1] Add ganeti3008 to the routed Ganeti cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1186516 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [14:17:35] (03PS1) 10Scott French: trafficserver: relax service lookup in mw-next-routing [puppet] - 10https://gerrit.wikimedia.org/r/1186520 (https://phabricator.wikimedia.org/T403655) [14:24:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T401906)', diff saved to https://phabricator.wikimedia.org/P82973 and previous config saved to /var/cache/conftool/dbconfig/20250909-142417-fceratto.json [14:24:22] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:24:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [14:24:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T401906)', diff saved to https://phabricator.wikimedia.org/P82974 and previous config saved to /var/cache/conftool/dbconfig/20250909-142441-fceratto.json [14:26:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T401906)', diff saved to https://phabricator.wikimedia.org/P82975 and previous config saved to /var/cache/conftool/dbconfig/20250909-142652-fceratto.json [14:28:06] (03PS1) 10Muehlenhoff: maps: Remove disable_tile_generation_timer (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1186523 [14:28:34] 10ops-eqsin, 06SRE: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519#11163076 (10RobH) >>! In T382519#11161831, @ayounsi wrote: > The physical anchor has been replaced by a VM, moving that task to DCops to recycle the failed hardware : https://netbox.wikimedia.org/dcim/devices/... [14:28:36] (03CR) 10CI reject: [V:04-1] maps: Remove disable_tile_generation_timer (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1186523 (owner: 10Muehlenhoff) [14:28:57] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:58] (03PS2) 10Muehlenhoff: maps: Remove disable_tile_generation_timer (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1186523 [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1430) [14:30:41] (03CR) 10Brouberol: [C:03+1] "Diff looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186487 (https://phabricator.wikimedia.org/T404068) (owner: 10Stevemunene) [14:33:51] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:34:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186523 (owner: 10Muehlenhoff) [14:34:51] (03CR) 10Clément Goubert: [C:03+1] "LGTM, adding @vgutierrez@wikimedia.org for a sanity check" [puppet] - 10https://gerrit.wikimedia.org/r/1186520 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French) [14:42:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P82977 and previous config saved to /var/cache/conftool/dbconfig/20250909-144159-fceratto.json [14:43:51] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:43:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1193 gradually with 4 steps - Pool db1193.eqiad.wmnet in after cloning [14:43:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1172.eqiad.wmnet onto db1193.eqiad.wmnet [14:44:18] (03PS3) 10Muehlenhoff: maps: Remove disable_tile_generation_timer [puppet] - 10https://gerrit.wikimedia.org/r/1186523 (https://phabricator.wikimedia.org/T381565) [14:45:55] (03CR) 10Cathal Mooney: [C:03+2] Nokia: module to configure BGP in network-instance and add IBGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/1184759 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [14:46:40] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti3008 to the routed Ganeti cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1186516 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [14:47:13] (03Merged) 10jenkins-bot: Nokia: module to configure BGP in network-instance and add IBGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/1184759 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [14:47:23] (03CR) 10Vgutierrez: [C:03+1] trafficserver: relax service lookup in mw-next-routing [puppet] - 10https://gerrit.wikimedia.org/r/1186520 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French) [14:47:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186523 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:49:49] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:55:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet [14:57:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P82979 and previous config saved to /var/cache/conftool/dbconfig/20250909-145707-fceratto.json [14:57:35] (03PS4) 10Muehlenhoff: maps: Remove disable_tile_generation_timer [puppet] - 10https://gerrit.wikimedia.org/r/1186523 (https://phabricator.wikimedia.org/T381565) [15:00:05] jelto, arnoldokoth, and mutante: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1500). [15:00:58] (03PS1) 10Muehlenhoff: maps: Move the setting for planet_sync_hours to the common role setting [puppet] - 10https://gerrit.wikimedia.org/r/1186533 (https://phabricator.wikimedia.org/T381565) [15:01:42] (03PS1) 10Federico Ceratto: upgrade.py: Restart Prometheus exporter [cookbooks] - 10https://gerrit.wikimedia.org/r/1186532 [15:01:42] (03CR) 10Federico Ceratto: "As discussed on IRC / call" [cookbooks] - 10https://gerrit.wikimedia.org/r/1186532 (owner: 10Federico Ceratto) [15:02:39] (03PS2) 10Federico Ceratto: upgrade.py: Restart Prometheus exporter [cookbooks] - 10https://gerrit.wikimedia.org/r/1186532 [15:04:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186533 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:05:08] (03CR) 10Elukey: [C:03+1] maps: Remove disable_tile_generation_timer [puppet] - 10https://gerrit.wikimedia.org/r/1186523 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:05:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet [15:05:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams03 and group B [15:05:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams03 and group B [15:06:04] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11163221 (10Kris_Litson_WMDE) Approved [15:06:14] (03CR) 10Elukey: [C:03+1] pyrra: tonecheck: bump revision to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1186022 (https://phabricator.wikimedia.org/T400071) (owner: 10Herron) [15:07:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3007.esams.wmnet to cluster esams03 and group B [15:07:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3007.esams.wmnet to cluster esams03 and group B [15:07:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams03 and group B [15:08:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3008.esams.wmnet to cluster esams03 and group B [15:08:58] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:10:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11163248 (10MoritzMuehlenhoff) [15:12:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T401906)', diff saved to https://phabricator.wikimedia.org/P82980 and previous config saved to /var/cache/conftool/dbconfig/20250909-151214-fceratto.json [15:12:19] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:12:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2236.codfw.wmnet with reason: Maintenance [15:12:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2236 (T401906)', diff saved to https://phabricator.wikimedia.org/P82981 and previous config saved to /var/cache/conftool/dbconfig/20250909-151238-fceratto.json [15:13:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11163279 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff esams is fully migrated to routed Ganeti! [15:14:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T401906)', diff saved to https://phabricator.wikimedia.org/P82982 and previous config saved to /var/cache/conftool/dbconfig/20250909-151449-fceratto.json [15:15:17] FIRING: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:17] RESOLVED: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:24:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T402925)', diff saved to https://phabricator.wikimedia.org/P82983 and previous config saved to /var/cache/conftool/dbconfig/20250909-152447-ladsgroup.json [15:24:52] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [15:25:27] (03CR) 10Herron: [C:03+2] pyrra: tonecheck: bump revision to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1186022 (https://phabricator.wikimedia.org/T400071) (owner: 10Herron) [15:26:03] 10SRE-SLO: Pyrra calculations for the Initial error budget value of calendar windows - https://phabricator.wikimedia.org/T403729#11163385 (10elukey) Opened https://github.com/pyrra-dev/pyrra/issues/1576 to upstream, to gather their feedback/opinion. [15:27:07] (03CR) 10Dzahn: "I am not aware of a reason why they fail unless something is different on trixie vs previous OS version." [puppet] - 10https://gerrit.wikimedia.org/r/1186509 (https://phabricator.wikimedia.org/T402596) (owner: 10Jcrespo) [15:29:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P82984 and previous config saved to /var/cache/conftool/dbconfig/20250909-152956-fceratto.json [15:31:33] (03CR) 10Dzahn: "I will try to find out what the reason is via check_bacula.py on the director.." [puppet] - 10https://gerrit.wikimedia.org/r/1186509 (https://phabricator.wikimedia.org/T402596) (owner: 10Jcrespo) [15:33:36] 06SRE, 06Traffic: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284#11163424 (10fnegri) 05Open→03Resolved a:03Dzahn Nice, thank you! I will optimistically mark as Resolved. [15:33:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:58] (03CR) 10Dzahn: "bacula-fd is listed as running, I don't have failed units. but I see these in the output of 'systemctl status bacula-fd':" [puppet] - 10https://gerrit.wikimedia.org/r/1186509 (https://phabricator.wikimedia.org/T402596) (owner: 10Jcrespo) [15:36:35] (03CR) 10Dzahn: "hrmmm.. https://www.google.com/search?q="Bad+caps+from+SD%3A+auth+cram-md5"" [puppet] - 10https://gerrit.wikimedia.org/r/1186509 (https://phabricator.wikimedia.org/T402596) (owner: 10Jcrespo) [15:36:42] (03PS1) 10Ahmon Dancy: Revert "buildkitd: Bump to v0.24.0" [puppet] - 10https://gerrit.wikimedia.org/r/1186540 [15:37:04] (03PS2) 10Ahmon Dancy: Revert "buildkitd: Bump to v0.24.0" [puppet] - 10https://gerrit.wikimedia.org/r/1186540 (https://phabricator.wikimedia.org/T403625) [15:37:08] (03PS9) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [15:37:52] (03CR) 10Ahmon Dancy: [C:03+1] "Dzahn, please deploy at your earliest convenience." [puppet] - 10https://gerrit.wikimedia.org/r/1186540 (https://phabricator.wikimedia.org/T403625) (owner: 10Ahmon Dancy) [15:38:45] (03CR) 10Dzahn: "this German-language thread seems to imply this could be due to the version difference between client and director being too large" [puppet] - 10https://gerrit.wikimedia.org/r/1186509 (https://phabricator.wikimedia.org/T402596) (owner: 10Jcrespo) [15:39:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P82985 and previous config saved to /var/cache/conftool/dbconfig/20250909-153955-ladsgroup.json [15:41:45] (03CR) 10Dzahn: "we have this combo here: director: 9.6.7-7 file daemon: 15.0.3-3" [puppet] - 10https://gerrit.wikimedia.org/r/1186509 (https://phabricator.wikimedia.org/T402596) (owner: 10Jcrespo) [15:42:15] (03CR) 10Dzahn: [C:03+2] Revert "buildkitd: Bump to v0.24.0" [puppet] - 10https://gerrit.wikimedia.org/r/1186540 (https://phabricator.wikimedia.org/T403625) (owner: 10Ahmon Dancy) [15:44:17] FIRING: ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P82986 and previous config saved to /var/cache/conftool/dbconfig/20250909-154503-fceratto.json [15:45:52] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [15:46:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:49:17] RESOLVED: ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:50:00] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11163532 (10RLazarus) @elukey Thank you! Looks like an ownership issue, and yes please if you're comfortable deploying those, I'll take you up on it. (We were just talking in serviceops about th... [15:55:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P82987 and previous config saved to /var/cache/conftool/dbconfig/20250909-155503-ladsgroup.json [15:56:17] FIRING: ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:59:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:00:05] jhathaway and moritzm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T401906)', diff saved to https://phabricator.wikimedia.org/P82988 and previous config saved to /var/cache/conftool/dbconfig/20250909-160010-fceratto.json [16:00:15] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [16:00:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2237.codfw.wmnet with reason: Maintenance [16:00:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2237 (T401906)', diff saved to https://phabricator.wikimedia.org/P82989 and previous config saved to /var/cache/conftool/dbconfig/20250909-160032-fceratto.json [16:01:17] FIRING: [3x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:01:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:02:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T401906)', diff saved to https://phabricator.wikimedia.org/P82990 and previous config saved to /var/cache/conftool/dbconfig/20250909-160243-fceratto.json [16:04:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:06:17] FIRING: [4x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:06:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11163622 (10cmooney) [16:09:50] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404104 (10phaultfinder) 03NEW [16:10:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T402925)', diff saved to https://phabricator.wikimedia.org/P82991 and previous config saved to /var/cache/conftool/dbconfig/20250909-161010-ladsgroup.json [16:10:15] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [16:11:17] FIRING: [5x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2152.codfw.wmnet with reason: Maintenance [16:11:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T402763)', diff saved to https://phabricator.wikimedia.org/P82992 and previous config saved to /var/cache/conftool/dbconfig/20250909-161142-fceratto.json [16:11:51] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [16:15:16] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1186543 (https://phabricator.wikimedia.org/T404106) [16:17:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T404106 [16:17:04] T404106: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T404106 [16:17:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P82993 and previous config saved to /var/cache/conftool/dbconfig/20250909-161751-fceratto.json [16:18:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T402763)', diff saved to https://phabricator.wikimedia.org/P82994 and previous config saved to /var/cache/conftool/dbconfig/20250909-161844-fceratto.json [16:18:48] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [16:18:52] (03PS1) 10Andrew Bogott: codfw1dev ceph -> version reef [puppet] - 10https://gerrit.wikimedia.org/r/1186544 [16:19:54] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev ceph -> version reef [puppet] - 10https://gerrit.wikimedia.org/r/1186544 (owner: 10Andrew Bogott) [16:22:30] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1186543 (https://phabricator.wikimedia.org/T404106) (owner: 10Gerrit maintenance bot) [16:23:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling required for fr-tech exp[ansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11163829 (10cmooney) [16:24:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling required for fr-tech exp[ansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11163842 (10cmooney) [16:24:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11163843 (10cmooney) [16:24:33] !log Starting s2 codfw failover from db2207 to db2204 - T404106 [16:24:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling required for fr-tech exp[ansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11163851 (10cmooney) a:05Jclark-ctr→03None [16:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:36] T404106: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T404106 [16:25:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling required for fr-tech expansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11163855 (10cmooney) [16:25:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2204 to s2 primary T404106', diff saved to https://phabricator.wikimedia.org/P82995 and previous config saved to /var/cache/conftool/dbconfig/20250909-162514-fceratto.json [16:27:41] (03CR) 10Scott French: "Thank you both for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1186520 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French) [16:27:42] (03CR) 10Scott French: [C:03+2] trafficserver: relax service lookup in mw-next-routing [puppet] - 10https://gerrit.wikimedia.org/r/1186520 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French) [16:28:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling required for fr-tech expansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11163873 (10cmooney) [16:32:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P82996 and previous config saved to /var/cache/conftool/dbconfig/20250909-163258-fceratto.json [16:33:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P82997 and previous config saved to /var/cache/conftool/dbconfig/20250909-163351-fceratto.json [16:33:56] !log drain transport circuits landing on cr1-codfw ahead of power supply test on router T401937 [16:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:59] T401937: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937 [16:35:45] PROBLEM - Host sretest2009 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:05] !log drain set BGP to graceful shutdown mode on cr1-codfw to drain traffic ahead of power supply test T401937 [16:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:39] FIRING: CoreBGPDown: Core BGP session down between cr3-eqsin and cr1-codfw (208.80.153.192) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr1-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:40:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling required for fr-tech expansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11163919 (10cmooney) [16:43:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqsin (103.102.166.130) - group Confed_eqsin - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:48:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T401906)', diff saved to https://phabricator.wikimedia.org/P82998 and previous config saved to /var/cache/conftool/dbconfig/20250909-164806-fceratto.json [16:48:11] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [16:48:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2239.codfw.wmnet with reason: Maintenance [16:48:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2240.codfw.wmnet with reason: Maintenance [16:48:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2240 (T401906)', diff saved to https://phabricator.wikimedia.org/P82999 and previous config saved to /var/cache/conftool/dbconfig/20250909-164836-fceratto.json [16:48:43] (03PS1) 10Reedy: TOTP: Fix logic for displaying TOTPEnableForm [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186548 (https://phabricator.wikimedia.org/T404091) [16:48:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P83000 and previous config saved to /var/cache/conftool/dbconfig/20250909-164858-fceratto.json [16:49:00] jouncebot: nowandnext [16:49:00] For the next 0 hour(s) and 10 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1600) [16:49:00] In 0 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1700) [16:49:09] (03CR) 10Reedy: [C:03+2] TOTP: Fix logic for displaying TOTPEnableForm [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186548 (https://phabricator.wikimedia.org/T404091) (owner: 10Reedy) [16:50:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T401906)', diff saved to https://phabricator.wikimedia.org/P83001 and previous config saved to /var/cache/conftool/dbconfig/20250909-165047-fceratto.json [16:51:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:30] 10ops-codfw, 06DC-Ops: sretest2009 test in nokia rack - https://phabricator.wikimedia.org/T404115 (10Jhancock.wm) 03NEW [16:54:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:55:22] 10ops-codfw, 06DC-Ops: sretest2009 test in nokia rack - https://phabricator.wikimedia.org/T404115#11164012 (10cmooney) [16:55:26] 06SRE, 06Infrastructure-Foundations, 10netops: codfw expansion: configure new Nokia switches in rows E/F - https://phabricator.wikimedia.org/T402590#11164013 (10cmooney) [16:55:48] 10ops-codfw, 06DC-Ops: sretest2009 test in nokia rack - https://phabricator.wikimedia.org/T404115#11164014 (10cmooney) Thanks Jenn that's great. Probably be into next week before I try to use it but that's all I need for now. [16:56:17] FIRING: [4x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:59:10] (03Merged) 10jenkins-bot: TOTP: Fix logic for displaying TOTPEnableForm [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186548 (https://phabricator.wikimedia.org/T404091) (owner: 10Reedy) [17:00:05] swfrench-wmf: #bothumor I � Unicode. All rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1700). [17:00:22] o/ [17:00:39] Reedy: are you in the process of backporting that patch? [17:01:11] PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:01:30] swfrench-wmf: Not really, I can happily wait [17:01:35] Just setting it up for the train [17:03:07] Reedy: ah, got it. so, I don't have any mediawiki deployments planned, but I do have some changes to a service it depends on, which would be preferable not to overlap with a deployment (just to minimize conflicting noise). [17:03:20] Go ahead :) [17:03:29] sounds good, thanks :) [17:03:51] (03PS1) 10DDesouza: Deploy Newcomers survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186550 (https://phabricator.wikimedia.org/T402915) [17:03:58] FIRING: NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [17:04:06] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:04:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T402763)', diff saved to https://phabricator.wikimedia.org/P83002 and previous config saved to /var/cache/conftool/dbconfig/20250909-170406-fceratto.json [17:04:11] (03CR) 10Scott French: [C:03+2] shellbox-syntaxhighlight: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186009 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [17:04:12] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [17:04:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2154.codfw.wmnet with reason: Maintenance [17:04:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186550 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza) [17:04:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T402763)', diff saved to https://phabricator.wikimedia.org/P83003 and previous config saved to /var/cache/conftool/dbconfig/20250909-170429-fceratto.json [17:05:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P83004 and previous config saved to /var/cache/conftool/dbconfig/20250909-170554-fceratto.json [17:06:24] (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186009 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [17:06:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2207.codfw.wmnet with reason: Maintenance [17:07:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T402763)', diff saved to https://phabricator.wikimedia.org/P83005 and previous config saved to /var/cache/conftool/dbconfig/20250909-170701-fceratto.json [17:07:11] RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:11:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T402763)', diff saved to https://phabricator.wikimedia.org/P83006 and previous config saved to /var/cache/conftool/dbconfig/20250909-171140-fceratto.json [17:11:44] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [17:12:07] !log restored locally modified `helmfile.d/dse-k8s-services/_airflow_common_/values-dev.yaml` duplicating https://gerrit.wikimedia.org/r/1186497 on deploy1003 to unstick deployment-charts updates [17:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T402763)', diff saved to https://phabricator.wikimedia.org/P83007 and previous config saved to /var/cache/conftool/dbconfig/20250909-171224-fceratto.json [17:12:50] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:13:08] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:13:58] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [17:14:20] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:14:42] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:15:21] !log migrated shellbox-syntaxhighlight to PHP 8.3 in codfw - T403284 [17:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:25] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [17:16:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:18:58] FIRING: NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [17:19:47] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.591 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:20:51] (03CR) 10Bking: [C:03+2] opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:21:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P83008 and previous config saved to /var/cache/conftool/dbconfig/20250909-172102-fceratto.json [17:21:17] FIRING: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:17] RESOLVED: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:27:02] 10ops-eqsin, 06SRE: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519#11164161 (10ayounsi) The anchor doesn't contain any sensitive data, so yep it can be unplugged and recycled anytime. [17:27:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P83010 and previous config saved to /var/cache/conftool/dbconfig/20250909-172731-fceratto.json [17:33:47] (03PS3) 10Ebernhardson: cirrus: Reduce galleries weight in search on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) [17:33:58] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [17:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:34:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11164212 (10cmooney) Papaul did some testing today shuffling things around. **Test 1: Remove PEM 0 from router** We did this, after a few... [17:35:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180890 (https://phabricator.wikimedia.org/T397476) (owner: 10Daimona Eaytoy) [17:35:17] FIRING: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:35:35] (03PS1) 10Bking: opensearch-operator: create namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186557 (https://phabricator.wikimedia.org/T397246) [17:36:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T401906)', diff saved to https://phabricator.wikimedia.org/P83011 and previous config saved to /var/cache/conftool/dbconfig/20250909-173609-fceratto.json [17:36:14] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [17:37:33] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:38:04] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:38:12] (03CR) 10Ssingh: "I am going to defer to Valentin for the review, since I do have a few questions about the pooled state (specifically, how much we should c" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [17:38:16] !log migrated shellbox-syntaxhighlight to PHP 8.3 in eqiad - T403284 [17:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:20] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [17:38:46] (03CR) 10Ebernhardson: cirrus: Reduce galleries weight in search on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [17:39:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:42:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P83012 and previous config saved to /var/cache/conftool/dbconfig/20250909-174239-fceratto.json [17:43:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqsin (103.102.166.130) - group Confed_eqsin - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:44:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:47:07] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS trixie [17:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:49:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:50:56] 10ops-esams, 06SRE, 06DC-Ops: esams: document power cables in Netbox - https://phabricator.wikimedia.org/T403376#11164310 (10RobH) a:03RobH We documented this onto the google elevation doc at the tiem of the esams migration, but never copied it over to netbox. https://docs.google.com/spreadsheets/d/17OS4d... [17:51:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:53:12] 10ops-esams, 06SRE, 06DC-Ops: esams: document power cables in Netbox - https://phabricator.wikimedia.org/T403376#11164341 (10RobH) p:05Triage→03Low [17:57:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T402763)', diff saved to https://phabricator.wikimedia.org/P83013 and previous config saved to /var/cache/conftool/dbconfig/20250909-175746-fceratto.json [17:57:51] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [17:58:02] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2163.codfw.wmnet with reason: Maintenance [17:58:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T402763)', diff saved to https://phabricator.wikimedia.org/P83014 and previous config saved to /var/cache/conftool/dbconfig/20250909-175808-fceratto.json [17:59:24] (03CR) 10Dzahn: [C:03+1] "thanks for creating https://phabricator.wikimedia.org/T404114 which confirms my suspicion" [puppet] - 10https://gerrit.wikimedia.org/r/1186509 (https://phabricator.wikimedia.org/T402596) (owner: 10Jcrespo) [17:59:52] (03CR) 10Dzahn: [C:03+2] bacula: Ignore backup failures from people1005 & people2004 [puppet] - 10https://gerrit.wikimedia.org/r/1186509 (https://phabricator.wikimedia.org/T402596) (owner: 10Jcrespo) [18:00:04] dduvall and dancy: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T1800). [18:00:16] I'm lingering. [18:00:45] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11164378 (10KFrancis) Hi @Johannes_Richter_WMDE I have sent the NDA for you to sign via DocuSign. Thanks! [18:01:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:03:37] dancy: dduvall I didn't yet deploy (or stage) the OATHAuth patch, but it is merged, so if you want to put that out before moving the train, that would be appreciated! [18:06:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T402763)', diff saved to https://phabricator.wikimedia.org/P83015 and previous config saved to /var/cache/conftool/dbconfig/20250909-180605-fceratto.json [18:06:09] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [18:09:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2141.codfw.wmnet with reason: Maintenance [18:14:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2145.codfw.wmnet with reason: Maintenance [18:14:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T402763)', diff saved to https://phabricator.wikimedia.org/P83016 and previous config saved to /var/cache/conftool/dbconfig/20250909-181448-fceratto.json [18:14:53] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [18:21:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P83017 and previous config saved to /var/cache/conftool/dbconfig/20250909-182112-fceratto.json [18:21:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T402763)', diff saved to https://phabricator.wikimedia.org/P83018 and previous config saved to /var/cache/conftool/dbconfig/20250909-182132-fceratto.json [18:21:36] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [18:23:46] (03PS10) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [18:24:50] (03CR) 10Dzahn: [C:03+2] "tested queries from phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/1185257 (https://phabricator.wikimedia.org/T403887) (owner: 10Aklapper) [18:24:53] (03CR) 10Ryan Kemper: wdqs: (step 2) remove wdqs discovery dns records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [18:25:57] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186561 (https://phabricator.wikimedia.org/T396379) [18:25:59] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186561 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot) [18:27:25] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186561 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot) [18:28:57] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:31:04] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephmon2004-dev.codfw.wmnet with OS trixie [18:32:00] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [18:32:05] Reedy: sorry, i missed your message before starting the train. is it ok to backport afterwards? [18:32:34] dduvall: Yeah, all good! It doesn't result in logspam, just a broken (from a UI pov) page for enabling 2FA [18:32:48] alrighty [18:36:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P83019 and previous config saved to /var/cache/conftool/dbconfig/20250909-183619-fceratto.json [18:36:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P83020 and previous config saved to /var/cache/conftool/dbconfig/20250909-183639-fceratto.json [18:37:07] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186557 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [18:39:00] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.18 refs T396379 [18:39:04] T396379: 1.45.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T396379 [18:40:54] !log dduvall@deploy1003 Started scap sync-world: Backport for [[gerrit:1186548|TOTP: Fix logic for displaying TOTPEnableForm (T404091 T230042)]] [18:41:00] T404091: Attempting to enable TOTP gives a page with no controls - https://phabricator.wikimedia.org/T404091 [18:41:00] T230042: Allow multiple TOTP devices - https://phabricator.wikimedia.org/T230042 [18:44:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:45:35] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:46:12] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:46:16] (03PS1) 10Jgreen: nsca_frack.cfg.erb create hostgroup fundraising-minio adding check-minio [puppet] - 10https://gerrit.wikimedia.org/r/1186566 (https://phabricator.wikimedia.org/T386259) [18:46:41] !log dduvall@deploy1003 dduvall, reedy: Backport for [[gerrit:1186548|TOTP: Fix logic for displaying TOTPEnableForm (T404091 T230042)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:46:47] T404091: Attempting to enable TOTP gives a page with no controls - https://phabricator.wikimedia.org/T404091 [18:46:47] T230042: Allow multiple TOTP devices - https://phabricator.wikimedia.org/T230042 [18:48:52] (03PS1) 10Dzahn: zuul(new): update path to ssh private key for zuul executor [puppet] - 10https://gerrit.wikimedia.org/r/1186567 (https://phabricator.wikimedia.org/T403847) [18:49:45] Reedy: let me know when i can continue with the backport [18:51:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T402763)', diff saved to https://phabricator.wikimedia.org/P83021 and previous config saved to /var/cache/conftool/dbconfig/20250909-185128-fceratto.json [18:51:33] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [18:51:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2164.codfw.wmnet with reason: Maintenance [18:51:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P83022 and previous config saved to /var/cache/conftool/dbconfig/20250909-185146-fceratto.json [18:51:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T402763)', diff saved to https://phabricator.wikimedia.org/P83023 and previous config saved to /var/cache/conftool/dbconfig/20250909-185158-fceratto.json [18:53:24] (03CR) 10Dzahn: [V:03+1 C:03+2] "/var/ssh/nodepool: cannot open `/var/ssh/nodepool' (No such file or directory)" [puppet] - 10https://gerrit.wikimedia.org/r/1186567 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [18:53:43] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11164653 (10Johannes_Richter_WMDE) >>! In T404080#11164378, @KFrancis wrote: > Hi @Johannes_Richter_WMDE I have sent the NDA for you to sign via DocuSign. Thanks! Thanks, signed. [18:54:52] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.022 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:57:46] (03CR) 10Dwisehaupt: [C:03+1] nsca_frack.cfg.erb create hostgroup fundraising-minio adding check-minio [puppet] - 10https://gerrit.wikimedia.org/r/1186566 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [18:59:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T402763)', diff saved to https://phabricator.wikimedia.org/P83024 and previous config saved to /var/cache/conftool/dbconfig/20250909-185950-fceratto.json [18:59:56] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [19:01:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:03:58] !log dduvall@deploy1003 dduvall, reedy: Continuing with sync [19:06:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.388 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:06:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T402763)', diff saved to https://phabricator.wikimedia.org/P83025 and previous config saved to /var/cache/conftool/dbconfig/20250909-190654-fceratto.json [19:06:59] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [19:07:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2146.codfw.wmnet with reason: Maintenance [19:07:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T402763)', diff saved to https://phabricator.wikimedia.org/P83026 and previous config saved to /var/cache/conftool/dbconfig/20250909-190716-fceratto.json [19:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:09:13] !log dduvall@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186548|TOTP: Fix logic for displaying TOTPEnableForm (T404091 T230042)]] (duration: 28m 18s) [19:09:18] T404091: Attempting to enable TOTP gives a page with no controls - https://phabricator.wikimedia.org/T404091 [19:09:19] T230042: Allow multiple TOTP devices - https://phabricator.wikimedia.org/T230042 [19:12:00] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS trixie [19:14:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T402763)', diff saved to https://phabricator.wikimedia.org/P83029 and previous config saved to /var/cache/conftool/dbconfig/20250909-191410-fceratto.json [19:14:15] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [19:14:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P83030 and previous config saved to /var/cache/conftool/dbconfig/20250909-191457-fceratto.json [19:15:58] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for musikanimal - https://phabricator.wikimedia.org/T403868#11164850 (10CDobbins) 05Open→03In progress a:03CDobbins [19:16:30] (03CR) 10CDobbins: [C:03+2] admin: upgrade musikanimal from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1186024 (https://phabricator.wikimedia.org/T403868) (owner: 10Dzahn) [19:19:40] musikanimal: here's your train conductor pin [19:25:16] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon2004-dev.codfw.wmnet with OS trixie [19:25:46] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm [19:29:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P83031 and previous config saved to /var/cache/conftool/dbconfig/20250909-192917-fceratto.json [19:30:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P83032 and previous config saved to /var/cache/conftool/dbconfig/20250909-193005-fceratto.json [19:32:46] (03CR) 10Bking: Replace elasticsearch api with python requests (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [19:34:13] (03PS1) 10Clare Ming: xLab: Deploy v1.0.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186584 (https://phabricator.wikimedia.org/T387173) [19:39:11] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v1.0.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186584 (https://phabricator.wikimedia.org/T387173) (owner: 10Clare Ming) [19:40:47] (03Merged) 10jenkins-bot: xLab: Deploy v1.0.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186584 (https://phabricator.wikimedia.org/T387173) (owner: 10Clare Ming) [19:42:02] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [19:44:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P83033 and previous config saved to /var/cache/conftool/dbconfig/20250909-194425-fceratto.json [19:45:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T402763)', diff saved to https://phabricator.wikimedia.org/P83034 and previous config saved to /var/cache/conftool/dbconfig/20250909-194512-fceratto.json [19:45:17] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [19:45:28] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2165.codfw.wmnet with reason: Maintenance [19:45:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T402763)', diff saved to https://phabricator.wikimedia.org/P83035 and previous config saved to /var/cache/conftool/dbconfig/20250909-194534-fceratto.json [19:46:44] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:49:25] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [19:50:38] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:51:59] musikanimal blow the horn! blow the horn! [19:52:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T402763)', diff saved to https://phabricator.wikimedia.org/P83036 and previous config saved to /var/cache/conftool/dbconfig/20250909-195210-fceratto.json [19:52:15] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [19:52:27] hey sorry in a meeting! bbl [19:53:11] gets deployment privs and is already in 24/7 meetings... it all goes by so fast... [19:56:05] ;) [19:58:51] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:59:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T402763)', diff saved to https://phabricator.wikimedia.org/P83037 and previous config saved to /var/cache/conftool/dbconfig/20250909-195932-fceratto.json [19:59:37] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [19:59:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2153.codfw.wmnet with reason: Maintenance [19:59:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T402763)', diff saved to https://phabricator.wikimedia.org/P83038 and previous config saved to /var/cache/conftool/dbconfig/20250909-195955-fceratto.json [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T2000). [20:00:05] danisztls and Daimona: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] o/ [20:00:12] o/ [20:00:13] I can self-deploy [20:01:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186550 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza) [20:02:49] (03Merged) 10jenkins-bot: Deploy Newcomers survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186550 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza) [20:02:49] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:03:16] !log dani@deploy1003 Started scap sync-world: Backport for [[gerrit:1186550|Deploy Newcomers survey on enwiki (T402915)]] [20:03:17] (03PS1) 10Gergő Tisza: Add $wgJwtPrivateKey / $wgJwtPublicKey in the fake privatre repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186592 (https://phabricator.wikimedia.org/T399631) [20:03:19] (03PS1) 10Gergő Tisza: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) [20:03:20] T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915 [20:04:04] (03CR) 10Gergő Tisza: [C:04-2] "Should probably reduce the cookie expiry first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [20:04:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186592 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [20:05:21] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1056.eqiad.wmnet with OS bookworm [20:05:33] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11165016 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm [20:05:59] added one more patch [20:06:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T402763)', diff saved to https://phabricator.wikimedia.org/P83039 and previous config saved to /var/cache/conftool/dbconfig/20250909-200641-fceratto.json [20:06:46] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [20:07:20] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:07:28] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:08:36] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm [20:09:26] !log dani@deploy1003 dani: Backport for [[gerrit:1186550|Deploy Newcomers survey on enwiki (T402915)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:30] T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915 [20:10:29] !log dani@deploy1003 dani: Continuing with sync [20:13:53] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:14:48] lmk if anyone needs a deployer [20:15:41] !log dani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186550|Deploy Newcomers survey on enwiki (T402915)]] (duration: 12m 25s) [20:15:45] T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915 [20:16:31] all done [20:16:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:16:38] Daimona: all yours [20:16:40] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:16:48] Thanks! I'll need a deployer though [20:17:45] Daimona: i can deploy for you - 1 sec [20:17:54] Thanks ^_^ [20:18:20] (03PS2) 10Daimona Eaytoy: Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180890 (https://phabricator.wikimedia.org/T397476) [20:19:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:21:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:21:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180890 (https://phabricator.wikimedia.org/T397476) (owner: 10Daimona Eaytoy) [20:21:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P83040 and previous config saved to /var/cache/conftool/dbconfig/20250909-202148-fceratto.json [20:22:22] (03Merged) 10jenkins-bot: Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180890 (https://phabricator.wikimedia.org/T397476) (owner: 10Daimona Eaytoy) [20:22:46] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1180890|Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_NEW (T397476)]] [20:22:50] T397476: Country of event data migration (free text -> code; optional -> required; remove country from address) - https://phabricator.wikimedia.org/T397476 [20:25:33] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Server provision script updates for Nokia switch support - https://phabricator.wikimedia.org/T404146 (10cmooney) 03NEW p:05Triage→03Medium [20:25:36] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Server provision script updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11165106 (10cmooney) [20:27:47] okay, mutante perryprog: Were you trying to get me in on this deployment window or what? Sorry I had the meeting! I'd be interested to do that but I thought for my first deploy I'd do so with Tim or Harumi (who are on my team) watching over me on video [20:27:51] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11165109 (10cmooney) [20:28:14] Oh no no, I was just congratulating you, lol [20:28:23] hehe okay! well thank you :) [20:28:50] !log cjming@deploy1003 cjming, daimona: Backport for [[gerrit:1180890|Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_NEW (T397476)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:28:54] T397476: Country of event data migration (free text -> code; optional -> required; remove country from address) - https://phabricator.wikimedia.org/T397476 [20:29:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.816 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:30:02] Daimona: on test servers if testable [20:30:06] Yup, testing [20:30:55] musikanimal: no, just wanted to tell you you can now [20:31:21] :) [20:31:38] musikanimal: the one thing you can do is confirm you can ssh to deployment servers and run the spiderpig ini command [20:31:54] sure, will do now [20:32:03] scap spiderpig-otp [20:32:28] that will allow you to use the web UI later [20:33:07] cjming: looks good AFAICT [20:33:13] great! syncing [20:33:17] !log cjming@deploy1003 cjming, daimona: Continuing with sync [20:33:20] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:33:26] Thank you! [20:33:32] yw! [20:33:58] tgr: will you self-deploy? [20:35:13] mutante: so I run `scap spiderpig-otp` on deploy1003 for example? [20:35:28] cjming: yes, thanks [20:35:31] oh there it is, I see it in the docs now [20:35:32] musikanimal: yes [20:36:00] and after that you can open https://spiderpig.wikimedia.org in a browser [20:36:06] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:36:40] so I can't login to SpiderPig yet it seems: "Service access denied due to missing privileges." [20:36:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P83041 and previous config saved to /var/cache/conftool/dbconfig/20250909-203656-fceratto.json [20:36:59] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1056.eqiad.wmnet with OS bookworm [20:37:02] using my Wikimedia developer account credentials [20:37:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11165144 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm executed with errors: - es1056 (**F... [20:37:16] but I can get into dpeloy10003! [20:37:21] *deploy1003 [20:37:37] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11165143 (10DonTrung) Has anyone actually used Wikimedia websites on a mobile device here? Most of the t... [20:37:45] musikanimal: debugging.. might be LDAP group [20:37:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2166.codfw.wmnet with reason: Maintenance [20:37:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T402763)', diff saved to https://phabricator.wikimedia.org/P83042 and previous config saved to /var/cache/conftool/dbconfig/20250909-203754-fceratto.json [20:37:55] ok, thanks [20:37:59] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [20:38:31] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180890|Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_NEW (T397476)]] (duration: 15m 45s) [20:38:35] T397476: Country of event data migration (free text -> code; optional -> required; remove country from address) - https://phabricator.wikimedia.org/T397476 [20:38:40] tgr: all yours [20:38:49] thx [20:38:49] Daimona: should be live! [20:39:02] Nice, thank you [20:39:14] musikanimal: it's not what I thought. you do have the LDAP group spiderpig-access and the deployment shell group.. hmm.. maybe we should say what's happening on your access request ticket [20:39:36] sure, I can comment there [20:39:45] thanks, sounds good [20:40:25] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for musikanimal - https://phabricator.wikimedia.org/T403868#11165149 (10Dzahn) I confirmed musikanimal has the deployment shell group on deploy1003 and the LDAP group `spiderpig-access`, yet something seems to be missing. [20:41:04] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for musikanimal - https://phabricator.wikimedia.org/T403868#11165152 (10MusikAnimal) Yes, when I try to login to SpiderPig with my Wikimedia developer account, I get `Service access denied due to missing privileges.` I can SSH into deploy1003 though... [20:41:26] musikanimal: did you ran that scap command though? and it seemed to work? [20:41:29] run [20:41:32] yep [20:41:33] musikanimal: did you try logging out and back in on idp.wikimedia.org? [20:41:38] ok [20:41:39] no let me try that [20:41:45] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11165154 (10Novem_Linguae) I don't think this ticket is to remove mobile mode (i.e. the skin MinervaNeue... [20:42:39] that worked! ty taavi [20:42:43] aha, cool [20:43:08] alright I'm in! I used the OTP and now I get a UI [20:43:26] ok, then I guess we can call it resolved. and you can actually try this whenever, with or without video [20:43:41] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11165163 (10TheDJ) >>! In T214998#11165143, @DonTrung wrote: > Has anyone actually used Wikimedia websit... [20:43:59] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for musikanimal - https://phabricator.wikimedia.org/T403868#11165165 (10MusikAnimal) Disregard! I needed to logout and log back in on idp.wikimedia.org. Looks like I'm all set now :) [20:44:34] yeah it seems somewhat foolproofed for inexperienced deployers like me? I really only ever plan to deploy config changes, and tiny fixes etc. [20:44:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:44:54] and if there's demand, I will help out here during deployment windows. Eventually :) [20:45:15] (03PS1) 10Ahmon Dancy: fix-staging-perms.sh: chmod 0664 patch files [puppet] - 10https://gerrit.wikimedia.org/r/1186600 (https://phabricator.wikimedia.org/T404145) [20:45:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T402763)', diff saved to https://phabricator.wikimedia.org/P83043 and previous config saved to /var/cache/conftool/dbconfig/20250909-204550-fceratto.json [20:45:55] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [20:46:54] deploying the private change together with the public change [20:47:03] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11165193 (10Rexogamer) do I understand correctly that you want to use the desktop version on mobile? if... [20:47:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186592 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [20:47:28] musikanimal: yea, maybe you want to do the very first one with others and a typo fix or something. yea, the point of spiderpig is to make it less intimidating [20:47:55] 👍 [20:48:03] (03Merged) 10jenkins-bot: Add $wgJwtPrivateKey / $wgJwtPublicKey in the fake privatre repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186592 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [20:48:27] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1186592|Add $wgJwtPrivateKey / $wgJwtPublicKey in the fake privatre repo (T399631)]] [20:48:29] I like the sound of that. I was deployer years ago, but gave it up because it was, well, too intimidating! I don't think I ever broke anything but I sure felt like I came close [20:48:31] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [20:48:49] mutante: Semi-urgent: Can you deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1186600 for me please? [20:49:41] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:49:43] What's intimidating about making live code changes to one of the biggest websites in the world where not only millions of people will notice if you mess up, but the record of you messing up and how exactly you messed up is publicly logged? [20:49:55] (<3) [20:50:01] dancy: ok [20:50:05] TY! [20:50:23] (03CR) 10Dzahn: [C:03+2] fix-staging-perms.sh: chmod 0664 patch files [puppet] - 10https://gerrit.wikimedia.org/r/1186600 (https://phabricator.wikimedia.org/T404145) (owner: 10Ahmon Dancy) [20:51:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:52:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T402763)', diff saved to https://phabricator.wikimedia.org/P83044 and previous config saved to /var/cache/conftool/dbconfig/20250909-205203-fceratto.json [20:52:08] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [20:52:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2170.codfw.wmnet with reason: Maintenance [20:52:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T402763)', diff saved to https://phabricator.wikimedia.org/P83045 and previous config saved to /var/cache/conftool/dbconfig/20250909-205226-fceratto.json [20:52:27] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:54:02] dancy: i manually ran the chmod for the one affected file [20:54:20] There are several files [20:54:26] `find /srv/patches -name "*.patch" -not -perm 0664 -print0 | xargs -0 -r chmod 0664` [20:54:34] !log tgr@deploy1003 tgr: Backport for [[gerrit:1186592|Add $wgJwtPrivateKey / $wgJwtPublicKey in the fake privatre repo (T399631)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:54:38] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [20:54:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.515 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:55:12] dancy: fixed [20:55:22] Thanks! [20:57:25] !log deploy1003/deploy2002 - find /srv/patches/ -name "*.patch" -not -perm 0664 -print0 | xargs -0 -r sudo chmod 0664 [20:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:51] !log tgr@deploy1003 tgr: Continuing with sync [20:59:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T402763)', diff saved to https://phabricator.wikimedia.org/P83046 and previous config saved to /var/cache/conftool/dbconfig/20250909-205903-fceratto.json [20:59:07] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [20:59:17] 06SRE, 06Release-Engineering-Team, 06serviceops: docker-registry will show different last updated time as you refresh the page... - https://phabricator.wikimedia.org/T404011#11165251 (10bd808) `lang=shell-session bd808@deploy1003:~$ curl -s https://docker-registry.discovery.wmnet | grep 'Last updated at'... [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250909T2100) [21:00:30] (03Abandoned) 10Jdlrobson: Drop deprecated survey prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832464 (https://phabricator.wikimedia.org/T317862) (owner: 10Awight) [21:00:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P83047 and previous config saved to /var/cache/conftool/dbconfig/20250909-210058-fceratto.json [21:01:18] (03CR) 10Jdlrobson: [C:04-1] "Needs rebase. Should be blocked on https://phabricator.wikimedia.org/T393436." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841) (owner: 10Awight) [21:01:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 1.685 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:04:06] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:04:43] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186592|Add $wgJwtPrivateKey / $wgJwtPublicKey in the fake privatre repo (T399631)]] (duration: 16m 16s) [21:04:48] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [21:06:13] !log UTC late deploys done [21:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:39] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [21:08:02] (03PS1) 10Dzahn: docker: add support for trixie, ensure docker-cli is installed [puppet] - 10https://gerrit.wikimedia.org/r/1186609 [21:08:17] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [21:09:04] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:11:56] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:13:43] (03CR) 10Dzahn: "dpkg -L docker.io | grep /usr/bin/docker" [puppet] - 10https://gerrit.wikimedia.org/r/1186609 (owner: 10Dzahn) [21:14:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P83048 and previous config saved to /var/cache/conftool/dbconfig/20250909-211410-fceratto.json [21:14:57] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11165349 (10Strainu) >>! In T214998#11165163, @TheDJ wrote: > This should make that easier and not harde... [21:16:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P83049 and previous config saved to /var/cache/conftool/dbconfig/20250909-211605-fceratto.json [21:16:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:54] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:24:41] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:29:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P83050 and previous config saved to /var/cache/conftool/dbconfig/20250909-212918-fceratto.json [21:29:35] 06SRE, 06Release-Engineering-Team, 06serviceops: docker-registry will show different last updated time as you refresh the page... - https://phabricator.wikimedia.org/T404011#11165359 (10bd808) Those content pages are generated by [[https://gerrit.wikimedia.org/g/operations/puppet/+/refs/heads/production/modu... [21:31:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T402763)', diff saved to https://phabricator.wikimedia.org/P83051 and previous config saved to /var/cache/conftool/dbconfig/20250909-213112-fceratto.json [21:31:16] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [21:31:20] 06SRE, 06Release-Engineering-Team, 06serviceops: docker-registry will show different last updated time as you refresh the page... - https://phabricator.wikimedia.org/T404011#11165361 (10bd808) 05Open→03Invalid Closing as "works as designed". I guess reopen if the timer skew feels problematic and not... [21:31:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2167.codfw.wmnet with reason: Maintenance [21:31:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T402763)', diff saved to https://phabricator.wikimedia.org/P83052 and previous config saved to /var/cache/conftool/dbconfig/20250909-213134-fceratto.json [21:32:46] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for musikanimal - https://phabricator.wikimedia.org/T403868#11165367 (10Dzahn) 05In progress→03Resolved [21:33:47] (03PS1) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) [21:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:34:59] (03CR) 10CI reject: [V:04-1] EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [21:36:02] (03PS2) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) [21:39:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T402763)', diff saved to https://phabricator.wikimedia.org/P83053 and previous config saved to /var/cache/conftool/dbconfig/20250909-213928-fceratto.json [21:39:33] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [21:41:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184937 (https://phabricator.wikimedia.org/T401227) (owner: 10BryanDavis) [21:42:34] (03Merged) 10jenkins-bot: beta: Remove replica instance from wmgMainStashServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184937 (https://phabricator.wikimedia.org/T401227) (owner: 10BryanDavis) [21:43:02] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1184937|beta: Remove replica instance from wmgMainStashServers (T401227)]] [21:43:06] T401227: DBQueryError "The MariaDB server is running with the --read-only option" fails MainStash in Beta Cluster - https://phabricator.wikimedia.org/T401227 [21:43:44] (03CR) 10Aklapper: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1185257 (https://phabricator.wikimedia.org/T403887) (owner: 10Aklapper) [21:43:59] oh, thanks Krinkle. I forgot to merge that I guess :/ [21:44:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T402763)', diff saved to https://phabricator.wikimedia.org/P83054 and previous config saved to /var/cache/conftool/dbconfig/20250909-214425-fceratto.json [21:44:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2173.codfw.wmnet with reason: Maintenance [21:44:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T402763)', diff saved to https://phabricator.wikimedia.org/P83055 and previous config saved to /var/cache/conftool/dbconfig/20250909-214449-fceratto.json [21:44:53] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [21:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:48:59] !log krinkle@deploy1003 krinkle, bd808: Backport for [[gerrit:1184937|beta: Remove replica instance from wmgMainStashServers (T401227)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:49:04] T401227: DBQueryError "The MariaDB server is running with the --read-only option" fails MainStash in Beta Cluster - https://phabricator.wikimedia.org/T401227 [21:49:38] !log krinkle@deploy1003 krinkle, bd808: Continuing with sync [21:49:56] bd808: np, didn't mean to ping you. I guess it figured out the happing on its own? [21:52:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T402763)', diff saved to https://phabricator.wikimedia.org/P83056 and previous config saved to /var/cache/conftool/dbconfig/20250909-215249-fceratto.json [21:52:54] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [21:54:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P83057 and previous config saved to /var/cache/conftool/dbconfig/20250909-215436-fceratto.json [21:55:06] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184937|beta: Remove replica instance from wmgMainStashServers (T401227)]] (duration: 12m 03s) [21:55:10] T401227: DBQueryError "The MariaDB server is running with the --read-only option" fails MainStash in Beta Cluster - https://phabricator.wikimedia.org/T401227 [21:58:58] (03PS1) 10Clare Ming: xLab: Deploy v1.0.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186616 (https://phabricator.wikimedia.org/T371225) [21:59:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:04:05] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [22:04:35] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [22:07:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P83058 and previous config saved to /var/cache/conftool/dbconfig/20250909-220757-fceratto.json [22:09:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P83059 and previous config saved to /var/cache/conftool/dbconfig/20250909-220944-fceratto.json [22:12:53] (03PS1) 10JHathaway: WIP: boot loop [cookbooks] - 10https://gerrit.wikimedia.org/r/1186619 [22:14:27] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11165529 (10Ladsgroup) Okay let me try: this change is not making the mobile mode go away. It doesn't ch... [22:14:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.345 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:16:12] (03PS2) 10JHathaway: WIP: boot loop [cookbooks] - 10https://gerrit.wikimedia.org/r/1186619 [22:17:14] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:20:29] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:23:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P83060 and previous config saved to /var/cache/conftool/dbconfig/20250909-222305-fceratto.json [22:23:44] (03PS3) 10JHathaway: provision: on reboot wait for bios attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1186619 [22:24:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T402763)', diff saved to https://phabricator.wikimedia.org/P83061 and previous config saved to /var/cache/conftool/dbconfig/20250909-222451-fceratto.json [22:24:56] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [22:25:07] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2181.codfw.wmnet with reason: Maintenance [22:25:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T402763)', diff saved to https://phabricator.wikimedia.org/P83062 and previous config saved to /var/cache/conftool/dbconfig/20250909-222514-fceratto.json [22:28:57] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:32:17] FIRING: [2x] ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:33:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T402763)', diff saved to https://phabricator.wikimedia.org/P83063 and previous config saved to /var/cache/conftool/dbconfig/20250909-223308-fceratto.json [22:33:13] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [22:37:17] FIRING: [17x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:38:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T402763)', diff saved to https://phabricator.wikimedia.org/P83064 and previous config saved to /var/cache/conftool/dbconfig/20250909-223812-fceratto.json [22:38:19] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [22:38:28] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2174.codfw.wmnet with reason: Maintenance [22:38:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T402763)', diff saved to https://phabricator.wikimedia.org/P83065 and previous config saved to /var/cache/conftool/dbconfig/20250909-223835-fceratto.json [22:42:17] FIRING: [12x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:45:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T402763)', diff saved to https://phabricator.wikimedia.org/P83066 and previous config saved to /var/cache/conftool/dbconfig/20250909-224527-fceratto.json [22:45:32] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [22:47:17] FIRING: [9x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:48:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P83067 and previous config saved to /var/cache/conftool/dbconfig/20250909-224816-fceratto.json [22:48:46] (03PS1) 10Dzahn: zuul::executor: include profile::pki::client [puppet] - 10https://gerrit.wikimedia.org/r/1186630 (https://phabricator.wikimedia.org/T403847) [22:49:08] (03CR) 10Dzahn: [C:03+2] zuul::executor: include profile::pki::client [puppet] - 10https://gerrit.wikimedia.org/r/1186630 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [22:49:29] (03PS2) 10Dzahn: zuul::executor: include profile::pki::client [puppet] - 10https://gerrit.wikimedia.org/r/1186630 (https://phabricator.wikimedia.org/T403847) [22:51:37] (03PS3) 10Dzahn: zuul::executor: add TLS certs for zookeeper config [puppet] - 10https://gerrit.wikimedia.org/r/1186630 (https://phabricator.wikimedia.org/T403847) [22:52:17] RESOLVED: [2x] ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:53:17] FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:54:27] (03CR) 10Dzahn: [C:03+2] zuul::executor: add TLS certs for zookeeper config [puppet] - 10https://gerrit.wikimedia.org/r/1186630 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [22:54:33] (03PS4) 10Dzahn: zuul::executor: add TLS certs for zookeeper config [puppet] - 10https://gerrit.wikimedia.org/r/1186630 (https://phabricator.wikimedia.org/T403847) [22:58:17] RESOLVED: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:58:47] FIRING: ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:59:31] (03CR) 10Dzahn: [C:03+2] zuul::executor: add TLS certs for zookeeper config [puppet] - 10https://gerrit.wikimedia.org/r/1186630 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [23:00:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P83068 and previous config saved to /var/cache/conftool/dbconfig/20250909-230034-fceratto.json [23:03:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P83069 and previous config saved to /var/cache/conftool/dbconfig/20250909-230323-fceratto.json [23:03:47] FIRING: [3x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:47] FIRING: [4x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:13:47] FIRING: [3x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:15:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P83070 and previous config saved to /var/cache/conftool/dbconfig/20250909-231542-fceratto.json [23:18:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T402763)', diff saved to https://phabricator.wikimedia.org/P83071 and previous config saved to /var/cache/conftool/dbconfig/20250909-231831-fceratto.json [23:18:35] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [23:18:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2195.codfw.wmnet with reason: Maintenance [23:18:47] RESOLVED: [2x] ProbeDown: Service wdqs2015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:18:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T402763)', diff saved to https://phabricator.wikimedia.org/P83072 and previous config saved to /var/cache/conftool/dbconfig/20250909-231854-fceratto.json [23:26:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T402763)', diff saved to https://phabricator.wikimedia.org/P83073 and previous config saved to /var/cache/conftool/dbconfig/20250909-232608-fceratto.json [23:26:13] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [23:30:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T402763)', diff saved to https://phabricator.wikimedia.org/P83074 and previous config saved to /var/cache/conftool/dbconfig/20250909-233049-fceratto.json [23:30:54] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2176.codfw.wmnet with reason: Maintenance [23:31:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T402763)', diff saved to https://phabricator.wikimedia.org/P83075 and previous config saved to /var/cache/conftool/dbconfig/20250909-233101-fceratto.json [23:37:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T402763)', diff saved to https://phabricator.wikimedia.org/P83076 and previous config saved to /var/cache/conftool/dbconfig/20250909-233752-fceratto.json [23:37:57] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [23:38:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1186639 [23:38:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1186639 (owner: 10TrainBranchBot) [23:38:16] (03PS3) 10RLazarus: all charts: Update mesh.configuration 1.13.0 to 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) [23:38:16] (03PS1) 10RLazarus: all charts: Update mesh.configuration 1.14.0 to 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186640 (https://phabricator.wikimedia.org/T403101) [23:39:17] FIRING: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:41:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P83077 and previous config saved to /var/cache/conftool/dbconfig/20250909-234116-fceratto.json [23:44:17] RESOLVED: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:48:37] (03CR) 10CI reject: [V:04-1] all charts: Update mesh.configuration 1.13.0 to 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [23:52:17] FIRING: ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:53:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P83078 and previous config saved to /var/cache/conftool/dbconfig/20250909-235300-fceratto.json [23:53:35] (03CR) 10Scott French: [C:03+1] "Does what is says on the tin!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186640 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [23:55:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1186639 (owner: 10TrainBranchBot) [23:56:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P83079 and previous config saved to /var/cache/conftool/dbconfig/20250909-235624-fceratto.json [23:57:17] RESOLVED: ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:58:07] (03CR) 10RLazarus: [C:03+2] all charts: Update mesh.configuration 1.14.0 to 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186640 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [23:59:17] FIRING: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown