[00:00:34] <wikibugs>	 (03Merged) 10jenkins-bot: all charts: Update mesh.configuration 1.14.0 to 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186640 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[00:04:17] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:08:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P83080 and previous config saved to /var/cache/conftool/dbconfig/20250910-000807-fceratto.json
[00:08:23] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1186645
[00:08:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1186645 (owner: 10TrainBranchBot)
[00:11:17] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:11:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T402763)', diff saved to https://phabricator.wikimedia.org/P83081 and previous config saved to /var/cache/conftool/dbconfig/20250910-001131-fceratto.json
[00:11:36] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[00:11:48] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2198.codfw.wmnet with reason: Maintenance
[00:13:34] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[00:14:11] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[00:15:14] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[00:16:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:16:26] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[00:16:56] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[00:18:02] <wikibugs>	 (03CR) 10RLazarus: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[00:18:05] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[00:21:17] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:22:17] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:23:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T402763)', diff saved to https://phabricator.wikimedia.org/P83082 and previous config saved to /var/cache/conftool/dbconfig/20250910-002315-fceratto.json
[00:23:19] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[00:23:31] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2188.codfw.wmnet with reason: Maintenance
[00:23:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T402763)', diff saved to https://phabricator.wikimedia.org/P83083 and previous config saved to /var/cache/conftool/dbconfig/20250910-002338-fceratto.json
[00:27:17] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:29:32] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs radosgw: use 'beast' http server rather than civetweb [puppet] - 10https://gerrit.wikimedia.org/r/1186647
[00:29:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T402763)', diff saved to https://phabricator.wikimedia.org/P83084 and previous config saved to /var/cache/conftool/dbconfig/20250910-002943-fceratto.json
[00:29:48] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[00:31:02] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1186645 (owner: 10TrainBranchBot)
[00:31:06] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186647 (owner: 10Andrew Bogott)
[00:35:18] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs radosgw: use 'beast' http server rather than civetweb [puppet] - 10https://gerrit.wikimedia.org/r/1186647
[00:39:02] <wikibugs>	 (03PS1) 10Andrew Bogott: Ceph rbd: remove option to use 'civetweb' front-end [puppet] - 10https://gerrit.wikimedia.org/r/1186649
[00:39:16] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186647 (owner: 10Andrew Bogott)
[00:39:19] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott)
[00:39:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Ceph rbd: remove option to use 'civetweb' front-end [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott)
[00:41:05] <wikibugs>	 (03CR) 10RLazarus: "The gargantuan CI diff is correct, after some semiautomated review: it's all chart patch-version bumps, associated checksums, and in some " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[00:42:50] <wikibugs>	 (03PS2) 10Andrew Bogott: Ceph rbd: remove option to use 'civetweb' front-end [puppet] - 10https://gerrit.wikimedia.org/r/1186649
[00:43:19] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott)
[00:44:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P83085 and previous config saved to /var/cache/conftool/dbconfig/20250910-004451-fceratto.json
[00:45:17] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs2015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:45:57] <wikibugs>	 (03PS1) 10Sbisson: CX3 Build 1.0.0+20250909 [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886)
[00:46:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886) (owner: 10Sbisson)
[00:48:08] <wikibugs>	 (03PS1) 10Sbisson: Desktop publish_success: add revid and pageid [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186651 (https://phabricator.wikimedia.org/T402975)
[00:48:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186651 (https://phabricator.wikimedia.org/T402975) (owner: 10Sbisson)
[00:50:17] <jinxer-wm>	 RESOLVED: ProbeDown: Service wdqs2015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:51:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:59:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P83086 and previous config saved to /var/cache/conftool/dbconfig/20250910-005958-fceratto.json
[01:00:49] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[01:04:06] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[01:06:17] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:07:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] wmcs radosgw: use 'beast' http server rather than civetweb [puppet] - 10https://gerrit.wikimedia.org/r/1186647 (owner: 10Andrew Bogott)
[01:11:17] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:12:58] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 12m 08s)
[01:15:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T402763)', diff saved to https://phabricator.wikimedia.org/P83087 and previous config saved to /var/cache/conftool/dbconfig/20250910-011506-fceratto.json
[01:15:11] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[01:15:22] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2202.codfw.wmnet with reason: Maintenance
[01:16:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:19:35] <jinxer-wm>	 FIRING: DiskSpace: Disk space deploy1003:9100:/ 3.975% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[01:19:59] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2203.codfw.wmnet with reason: Maintenance
[01:20:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T402763)', diff saved to https://phabricator.wikimedia.org/P83088 and previous config saved to /var/cache/conftool/dbconfig/20250910-012006-fceratto.json
[01:25:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T402763)', diff saved to https://phabricator.wikimedia.org/P83089 and previous config saved to /var/cache/conftool/dbconfig/20250910-012533-fceratto.json
[01:25:38] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[01:26:17] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:33:58] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[01:36:17] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:36:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:41:17] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:45:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:48:58] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[02:11:09] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2216.codfw.wmnet with reason: Maintenance
[02:11:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T402763)', diff saved to https://phabricator.wikimedia.org/P83090 and previous config saved to /var/cache/conftool/dbconfig/20250910-021116-fceratto.json
[02:11:21] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[02:17:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T402763)', diff saved to https://phabricator.wikimedia.org/P83091 and previous config saved to /var/cache/conftool/dbconfig/20250910-021720-fceratto.json
[02:17:25] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[02:26:17] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:28:58] <jinxer-wm>	 FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:31:17] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:32:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P83092 and previous config saved to /var/cache/conftool/dbconfig/20250910-023228-fceratto.json
[02:36:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:39:23] <wikibugs>	 (03PS1) 10RLazarus: {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186676 (https://phabricator.wikimedia.org/T402584)
[02:42:26] <wikibugs>	 (03PS2) 10RLazarus: {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186676 (https://phabricator.wikimedia.org/T403663)
[02:46:17] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:47:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P83093 and previous config saved to /var/cache/conftool/dbconfig/20250910-024735-fceratto.json
[02:56:17] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:01:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:02:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T402763)', diff saved to https://phabricator.wikimedia.org/P83094 and previous config saved to /var/cache/conftool/dbconfig/20250910-030243-fceratto.json
[03:02:48] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[03:08:58] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[03:56:20] <wikibugs>	 (03PS1) 10Papaul: Remove OSFP from mr1-eqsin and cr2/3-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1186682 (https://phabricator.wikimedia.org/T294845)
[04:26:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:31:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:31:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:36:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:04:06] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[05:08:58] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:19:50] <jinxer-wm>	 FIRING: DiskSpace: Disk space deploy1003:9100:/ 3.39% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[05:33:58] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:33:58] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[05:40:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/2 (Transport: cr2-codfw:xe-0/1/1:1 (Lumen, 442550293) {#5249}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:45:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:45:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:46:49] <moritzm>	 !log rebalance ganeti03 in esams T402259
[05:46:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:46:53] <stashbot>	 T402259: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259
[05:48:58] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[05:51:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM, comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1186609 (owner: 10Dzahn)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T0600)
[06:04:54] <moritzm>	 !log installing node-minipass security updates
[06:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:11:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:16:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:28:58] <jinxer-wm>	 FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:40:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[06:40:46] <icinga-wm>	 PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100%
[06:40:51] <jinxer-wm>	 FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:44:32] <wikibugs>	 (03Restored) 10Thiemo Kreuz (WMDE): Drop deprecated survey prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832464 (https://phabricator.wikimedia.org/T317862) (owner: 10Awight)
[06:45:48] <icinga-wm>	 RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 35.64 ms
[06:50:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[06:50:50] <wikibugs>	 (03CR) 10Elukey: [C:03+1] maps: Move the setting for planet_sync_hours to the common role setting [puppet] - 10https://gerrit.wikimedia.org/r/1186533 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[06:50:51] <jinxer-wm>	 FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:06:06] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/device-analytics: sync
[07:06:21] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/device-analytics: sync
[07:08:58] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[07:11:27] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/edit-analytics: sync
[07:11:41] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/edit-analytics: sync
[07:11:51] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: sync
[07:12:06] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: sync
[07:12:46] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/geo-analytics: sync
[07:13:01] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/geo-analytics: sync
[07:13:24] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/image-suggestion: sync
[07:13:35] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/image-suggestion: sync
[07:13:55] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: sync
[07:14:10] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: sync
[07:14:32] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: sync
[07:14:47] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: sync
[07:19:40] <wikibugs>	 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11166132 (10elukey) Ack! Upgraded staging, and pinged the DSE SREs as well on slack to gather their opinion about ownership etc..
[07:30:06] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Remove OSFP from mr1-eqsin and cr2/3-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1186682 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul)
[07:31:21] <wikibugs>	 (03PS2) 10Bartosz Wójtowicz: statistics: Update model upload script to check for correct boto3 version. [puppet] - 10https://gerrit.wikimedia.org/r/1180823 (https://phabricator.wikimedia.org/T394301)
[07:35:18] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/1180823 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz)
[07:41:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] maps: Remove disable_tile_generation_timer [puppet] - 10https://gerrit.wikimedia.org/r/1186523 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:41:29] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1009.eqiad.wmnet
[07:41:30] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1009.eqiad.wmnet
[07:41:44] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1009.eqiad.wmnet
[07:44:04] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply
[07:44:19] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config-next: apply
[07:44:52] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply
[07:45:10] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config: apply
[07:46:49] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1009.eqiad.wmnet
[07:47:12] <wikibugs>	 (03PS2) 10Muehlenhoff: maps: Move the setting for planet_sync_hours to the common role setting [puppet] - 10https://gerrit.wikimedia.org/r/1186533 (https://phabricator.wikimedia.org/T381565)
[07:48:22] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[07:49:10] <moritzm>	 !log upgrading Envoy on chartmuseum* T402584
[07:49:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:14] <stashbot>	 T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584
[07:50:18] <brouberol>	 !log upgraded envoy on dse-k8s-eqiad/dataset-config(-next) - T402584
[07:50:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:04] <icinga-wm>	 PROBLEM - Host ml-serve1009 is DOWN: PING CRITICAL - Packet loss = 100%
[07:51:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186533 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:53:32] <icinga-wm>	 RECOVERY - Host ml-serve1009 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[07:53:37] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[07:54:12] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1009.eqiad.wmnet
[07:54:13] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1009.eqiad.wmnet
[07:54:27] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1010.eqiad.wmnet
[07:55:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Update Ganeti alias for esams [puppet] - 10https://gerrit.wikimedia.org/r/1186925 (https://phabricator.wikimedia.org/T402259)
[07:56:27] <wikibugs>	 (03CR) 10Ayounsi: "overall lgtm, some comments" [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[07:59:43] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1010.eqiad.wmnet
[08:00:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Update Ganeti alias for esams [puppet] - 10https://gerrit.wikimedia.org/r/1186925 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[08:00:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] maps: Move the setting for planet_sync_hours to the common role setting [puppet] - 10https://gerrit.wikimedia.org/r/1186533 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:00:42] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[08:02:42] <icinga-wm>	 PROBLEM - Host ml-serve1010 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:12] <icinga-wm>	 RECOVERY - Host ml-serve1010 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[08:05:56] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[08:06:43] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1010.eqiad.wmnet
[08:06:43] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1010.eqiad.wmnet
[08:06:59] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1011.eqiad.wmnet
[08:12:04] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1011.eqiad.wmnet
[08:12:50] <wikibugs>	 (03PS1) 10JMeybohm: Code changes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1186929
[08:13:48] <wikibugs>	 (03PS1) 10Brouberol: runner: redact mysql passwors from the command string when reporting an error [dumps] - 10https://gerrit.wikimedia.org/r/1186930 (https://phabricator.wikimedia.org/T404162)
[08:14:16] <wikibugs>	 (03CR) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[08:14:25] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1011.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[08:14:35] <wikibugs>	 (03PS2) 10JMeybohm: Fix rename success handling, BackendHandler additions [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1186929
[08:14:54] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Fix rename success handling, BackendHandler additions [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1186929 (owner: 10JMeybohm)
[08:15:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11166246 (10elukey)
[08:15:21] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin1002"
[08:15:22] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin1002
[08:16:14] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin1002
[08:16:15] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin1002"
[08:16:54] <icinga-wm>	 PROBLEM - Host ml-serve1011 is DOWN: PING CRITICAL - Packet loss = 100%
[08:17:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11166249 (10elukey) 05Open→03Resolved All hosts done, and the provision cookbook now supports them. The supermicros for ML have a special firmware (Legacy/UEFI) that don...
[08:18:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11166252 (10elukey)
[08:19:12] <wikibugs>	 (03PS2) 10Brouberol: runner: redact mysql password from the command string when reporting an error [dumps] - 10https://gerrit.wikimedia.org/r/1186930 (https://phabricator.wikimedia.org/T404162)
[08:19:22] <icinga-wm>	 RECOVERY - Host ml-serve1011 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[08:19:27] <wikibugs>	 (03CR) 10Ayounsi: EBGP Config: Move all ASN definitions to 'asns_mapping' (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[08:19:41] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1011.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[08:19:52] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1011.eqiad.wmnet
[08:19:53] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1011.eqiad.wmnet
[08:41:23] <wikibugs>	 (03PS3) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577)
[08:42:03] <wikibugs>	 (03CR) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[08:42:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[08:44:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565)
[08:44:32] <wikibugs>	 (03PS4) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577)
[08:45:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:45:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[08:46:42] <wikibugs>	 (03PS5) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577)
[08:48:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[08:51:15] <wikibugs>	 (03PS6) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577)
[08:57:14] <wikibugs>	 (03PS2) 10Muehlenhoff: Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565)
[09:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:03:04] <wikibugs>	 (03PS15) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney)
[09:03:57] <wikibugs>	 (03PS1) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[09:04:06] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:04:42] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:06:13] <wikibugs>	 (03PS2) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[09:07:03] <wikibugs>	 (03CR) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney)
[09:09:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 (owner: 10Slyngshede)
[09:14:13] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: flip debdeploy::client::ensure to present [puppet] - 10https://gerrit.wikimedia.org/r/1186937 (https://phabricator.wikimedia.org/T336845)
[09:14:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: exclude nfs/nfs4 from debdeploy::client in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1186938 (https://phabricator.wikimedia.org/T336845)
[09:15:44] <wikibugs>	 (03CR) 10Elukey: Setup maps2011 as master node for new maps/codfw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:19:50] <jinxer-wm>	 FIRING: DiskSpace: Disk space deploy1003:9100:/ 2.797% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[09:25:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11166358 (10ayounsi) > We need to allow port number 48 on the Nokias, but not port number 0 as they start from 1 We already (and lazily) do :  `min_value=0, max_value=48` which...
[09:27:03] <wikibugs>	 (03CR) 10Muehlenhoff: Setup maps2011 as master node for new maps/codfw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:28:10] <claime>	 !log cgoubert@deploy1003:/home$ sudo lvextend -L +20G /dev/vg0/root && sudo resize2fs /dev/vg0/root - T404060
[09:28:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:14] <stashbot>	 T404060: makeMailingList.php creates 30GB of data - https://phabricator.wikimedia.org/T404060
[09:29:35] <jinxer-wm>	 RESOLVED: DiskSpace: Disk space deploy1003:9100:/ 2.773% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[09:29:52] <wikibugs>	 (03PS3) 10Muehlenhoff: Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565)
[09:30:15] <wikibugs>	 (03CR) 10Muehlenhoff: Setup maps2011 as master node for new maps/codfw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:30:18] <godog>	 claime: you may be interested in lvextend --resizefs --size <etc> FWIW
[09:30:43] <claime>	 godog: Ah, didn't know you could add a switch to do both at the same time, thanks!
[09:30:57] <claime>	 I haven't looked at lvextend's man page in something like a decade lol
[09:31:28] <godog>	 sure np! yeah I think it's has been added "recently"
[09:32:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1186937 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi)
[09:32:23] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:33:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Do we really still use NFS3? But LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1186938 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi)
[09:33:58] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[09:36:03] <wikibugs>	 (03CR) 10Elukey: Setup maps2011 as master node for new maps/codfw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:36:06] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:37:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on A:logstash-collector
[09:37:24] <wikibugs>	 (03PS1) 10Elukey: role::maps: fix tegola_swift_container for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1186945 (https://phabricator.wikimedia.org/T381565)
[09:37:33] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "I think we tried this once, not sure why we did not get on with it though. Thanks for pushing for this again though." [puppet] - 10https://gerrit.wikimedia.org/r/1186938 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi)
[09:37:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1186945 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[09:38:22] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::maps: fix tegola_swift_container for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1186945 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[09:38:26] <moritzm>	 !log upgrading Envoy on Logstash T402584
[09:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:30] <stashbot>	 T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584
[09:40:21] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1186937 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi)
[09:40:49] <wikibugs>	 (03CR) 10Elukey: [C:03+1] provision: on reboot wait for bios attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1186619 (owner: 10JHathaway)
[09:41:43] <wikibugs>	 (03PS3) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[09:41:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:42:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2011.codfw.wmnet
[09:44:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on A:logstash-collector
[09:45:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:45:58] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[09:46:32] <wikibugs>	 (03PS4) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[09:47:42] <moritzm>	 !log upgrading Envoy on contint T402584
[09:47:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:45] <stashbot>	 T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584
[09:48:01] <wikibugs>	 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171 (10elukey) 03NEW
[09:48:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2011.codfw.wmnet
[09:48:58] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:49:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2012.codfw.wmnet
[09:50:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:51:47] <wikibugs>	 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11166493 (10elukey)
[09:53:12] <wikibugs>	 (03PS5) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[09:55:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2012.codfw.wmnet
[09:56:41] <moritzm>	 !log upgrading Envoy on lists T402584
[09:56:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:45] <stashbot>	 T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584
[09:56:53] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1248.eqiad.wmnet with reason: Maintenance
[09:57:01] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1248 (T402763)', diff saved to https://phabricator.wikimedia.org/P83095 and previous config saved to /var/cache/conftool/dbconfig/20250910-095700-ladsgroup.json
[09:57:05] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[09:57:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2013.codfw.wmnet
[09:58:04] <wikibugs>	 (03PS6) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1000)
[10:02:33] <wikibugs>	 (03CR) 10Vgutierrez: sre.loadbalancer: modify admin.py to accept 'reboot' action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins)
[10:03:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2013.codfw.wmnet
[10:03:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2014.codfw.wmnet
[10:05:29] <wikibugs>	 (03PS7) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[10:06:30] <moritzm>	 !log upgrading Envoy on lists T402584
[10:06:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:34] <stashbot>	 T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584
[10:06:37] <wikibugs>	 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11166565 (10MoritzMuehlenhoff) All baremetal installations of Envoy have been upgraded
[10:06:42] <moritzm>	 !log upgrading Envoy on Phabricator  T402584
[10:06:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:22] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1186946 (https://phabricator.wikimedia.org/T404178)
[10:08:45] <wikibugs>	 (03PS8) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[10:09:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2014.codfw.wmnet
[10:09:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:10:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1011.eqiad.wmnet
[10:10:57] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T404178
[10:11:01] <stashbot>	 T404178: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T404178
[10:11:27] <wikibugs>	 (03PS1) 10Ladsgroup: db1181: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186947 (https://phabricator.wikimedia.org/T399955)
[10:12:31] <moritzm>	 !log imported imposm3 0.14.1-2 T381565
[10:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:36] <stashbot>	 T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565
[10:13:03] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11166591 (10MoritzMuehlenhoff) I've updated imposm once more to cherrypick two additional fixes:  https://github.com/omniscale/imposm3/commit/dc3ebd0746ba7a73b2099c2cda343fc2c6d8d206...
[10:13:45] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Glow up db1181 (T399955)', diff saved to https://phabricator.wikimedia.org/P83096 and previous config saved to /var/cache/conftool/dbconfig/20250910-101345-ladsgroup.json
[10:13:50] <stashbot>	 T399955: Migrate s7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T399955
[10:14:45] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: Glow up
[10:15:28] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] db1181: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186947 (https://phabricator.wikimedia.org/T399955) (owner: 10Ladsgroup)
[10:15:56] <wikibugs>	 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11166605 (10elukey) As a first very bare/minimum example I created:  ` version: "prometheus/v1" service: "citoid" labels:   owner: "sre" slos:   - name: "requests-availability"     objective: 99.5     descri...
[10:16:32] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:17:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1011.eqiad.wmnet
[10:17:59] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T402763)', diff saved to https://phabricator.wikimedia.org/P83097 and previous config saved to /var/cache/conftool/dbconfig/20250910-101758-ladsgroup.json
[10:18:02] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[10:20:49] <wikibugs>	 (03PS16) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney)
[10:22:35] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1186946 (https://phabricator.wikimedia.org/T404178) (owner: 10Gerrit maintenance bot)
[10:24:09] <federico3>	 !log Starting s1 codfw failover from db2212 to db2203 - T404178
[10:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:13] <stashbot>	 T404178: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T404178
[10:25:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2203 to s1 primary T404178', diff saved to https://phabricator.wikimedia.org/P83098 and previous config saved to /var/cache/conftool/dbconfig/20250910-102507-fceratto.json
[10:27:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1012.eqiad.wmnet
[10:28:58] <jinxer-wm>	 FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:29:54] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:31:36] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:33:06] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P83099 and previous config saved to /var/cache/conftool/dbconfig/20250910-103305-ladsgroup.json
[10:33:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1012.eqiad.wmnet
[10:34:37] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repool db1181 T399955', diff saved to https://phabricator.wikimedia.org/P83100 and previous config saved to /var/cache/conftool/dbconfig/20250910-103436-ladsgroup.json
[10:34:42] <stashbot>	 T399955: Migrate s7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T399955
[10:34:50] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.661 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:35:32] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1186948 (https://phabricator.wikimedia.org/T404180)
[10:35:37] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1186949 (https://phabricator.wikimedia.org/T404180)
[10:36:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney)
[10:37:30] <Amir1>	 jouncebot: nowandnext
[10:37:30] <jouncebot>	 For the next 0 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1000)
[10:37:30] <jouncebot>	 In 0 hour(s) and 22 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1100)
[10:40:49] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T404180
[10:40:53] <stashbot>	 T404180: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T404180
[10:41:16] <wikibugs>	 (03PS17) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney)
[10:41:28] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 3.305 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:42:23] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set db1181 with weight 0 T404180', diff saved to https://phabricator.wikimedia.org/P83101 and previous config saved to /var/cache/conftool/dbconfig/20250910-104223-ladsgroup.json
[10:48:14] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P83103 and previous config saved to /var/cache/conftool/dbconfig/20250910-104813-ladsgroup.json
[10:48:53] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1186948 (https://phabricator.wikimedia.org/T404180) (owner: 10Gerrit maintenance bot)
[10:49:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1013.eqiad.wmnet
[10:50:21] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db1173.eqiad.wmnet
[10:50:28] <Amir1>	 !log Starting s7 eqiad failover from db1236 to db1181 - T404180
[10:50:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:32] <stashbot>	 T404180: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T404180
[10:50:42] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1173 - Upgrading db1173.eqiad.wmnet
[10:50:43] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T404180', diff saved to https://phabricator.wikimedia.org/P83104 and previous config saved to /var/cache/conftool/dbconfig/20250910-105042-ladsgroup.json
[10:51:06] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:51:31] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1173 - Upgrading db1173.eqiad.wmnet
[10:52:05] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote db1181 to s7 primary and set section read-write T404180', diff saved to https://phabricator.wikimedia.org/P83106 and previous config saved to /var/cache/conftool/dbconfig/20250910-105205-ladsgroup.json
[10:54:03] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1186949 (https://phabricator.wikimedia.org/T404180) (owner: 10Gerrit maintenance bot)
[10:54:19] <logmsgbot>	 !log ladsgroup@dns1004 START - running authdns-update
[10:55:26] <logmsgbot>	 !log ladsgroup@dns1004 END - running authdns-update
[10:55:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1013.eqiad.wmnet
[10:56:51] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1236 T404180', diff saved to https://phabricator.wikimedia.org/P83107 and previous config saved to /var/cache/conftool/dbconfig/20250910-105650-ladsgroup.json
[10:56:55] <stashbot>	 T404180: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T404180
[10:57:50] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1173.eqiad.wmnet
[10:58:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1014.eqiad.wmnet
[11:00:05] <jouncebot>	 mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1100). nyaa~
[11:00:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11166780 (10MoritzMuehlenhoff)
[11:03:21] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T402763)', diff saved to https://phabricator.wikimedia.org/P83109 and previous config saved to /var/cache/conftool/dbconfig/20250910-110320-ladsgroup.json
[11:03:25] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[11:03:36] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1249.eqiad.wmnet with reason: Maintenance
[11:03:44] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1249 (T402763)', diff saved to https://phabricator.wikimedia.org/P83110 and previous config saved to /var/cache/conftool/dbconfig/20250910-110343-ladsgroup.json
[11:04:05] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db1173.eqiad.wmnet
[11:04:25] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1173 - Upgrading db1173.eqiad.wmnet
[11:04:33] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1173 - Upgrading db1173.eqiad.wmnet
[11:04:54] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186953
[11:05:18] <icinga-wm>	 PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale-full only: 3 (gerrit2003, ...), Fresh: 135 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:05:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1014.eqiad.wmnet
[11:05:53] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1236.eqiad.wmnet with reason: Clean up the mess
[11:07:39] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1236.eqiad.wmnet
[11:07:47] <moritzm>	 !log kick off full OSM import for the new maps cluster in codfw T381565
[11:07:48] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1236 - Upgrading db1236.eqiad.wmnet
[11:07:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:51] <stashbot>	 T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565
[11:07:55] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1236 - Upgrading db1236.eqiad.wmnet
[11:08:26] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] upgrade.py: Restart Prometheus exporter [cookbooks] - 10https://gerrit.wikimedia.org/r/1186532 (owner: 10Federico Ceratto)
[11:08:58] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[11:09:30] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2212.codfw.wmnet with reason: Maintenance
[11:09:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T402763)', diff saved to https://phabricator.wikimedia.org/P83111 and previous config saved to /var/cache/conftool/dbconfig/20250910-110937-fceratto.json
[11:09:42] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[11:10:43] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1173 gradually with 4 steps - Upgrade of db1173.eqiad.wmnet completed
[11:13:34] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1236.eqiad.wmnet
[11:15:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T402763)', diff saved to https://phabricator.wikimedia.org/P83113 and previous config saved to /var/cache/conftool/dbconfig/20250910-111503-fceratto.json
[11:15:08] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[11:19:37] <wikibugs>	 (03PS9) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[11:20:49] <wikibugs>	 (03PS3) 10Tchanders: Enable temporary accounts on all medium-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) (owner: 10STran)
[11:22:38] <wikibugs>	 (03PS1) 10Ladsgroup: db1236: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186957 (https://phabricator.wikimedia.org/T399955)
[11:22:40] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] Enable temporary accounts on all medium-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) (owner: 10STran)
[11:23:55] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T402763)', diff saved to https://phabricator.wikimedia.org/P83114 and previous config saved to /var/cache/conftool/dbconfig/20250910-112354-ladsgroup.json
[11:23:59] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[11:24:30] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] db1236: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186957 (https://phabricator.wikimedia.org/T399955) (owner: 10Ladsgroup)
[11:25:15] <moritzm>	 !log installing Linux 6.1.148 on Bookworm hosts
[11:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:31] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s7 #page on db1236 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:27:31] <icinga-wm>	 PROBLEM - MariaDB read only s7 on db1236 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[11:27:32] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s7 #page on db1236 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:27:49] <jynus>	 ^ federico3 expected?
[11:27:53] <icinga-wm>	 PROBLEM - mysqld processes #page on db1236 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[11:27:54] <federico3>	 no, looking
[11:28:09] <federico3>	 Amir1?
[11:28:13] <Amir1>	 ignore the alert
[11:28:22] <jynus>	 ok
[11:28:24] <Amir1>	 expired/removed downtime
[11:28:33] <wikibugs>	 (03PS1) 10Ayounsi: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958
[11:28:33] <Amir1>	 !incidents
[11:28:34] <sirenbot>	 6723 (UNACKED)  db1236 (paged)/MariaDB Replica SQL: s7 (paged)
[11:28:34] <sirenbot>	 6724 (UNACKED)  db1236 (paged)/MariaDB Replica IO: s7 (paged)
[11:28:34] <sirenbot>	 6725 (UNACKED)  db1236 (paged)/mysqld processes (paged)
[11:28:40] <Amir1>	 !ack 6723
[11:28:41] <sirenbot>	 6723 (ACKED)  db1236 (paged)/MariaDB Replica SQL: s7 (paged)
[11:28:44] <Amir1>	 !ack 6724
[11:28:45] <sirenbot>	 6724 (ACKED)  db1236 (paged)/MariaDB Replica IO: s7 (paged)
[11:28:51] <Amir1>	 !ack 6725
[11:28:51] <sirenbot>	 6725 (ACKED)  db1236 (paged)/mysqld processes (paged)
[11:29:06] <sobanski>	 Related to https://phabricator.wikimedia.org/T399955?
[11:29:35] <Amir1>	 yup but I did downtime the whole thing for two hours
[11:29:39] <wikibugs>	 (03PS10) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[11:29:53] <icinga-wm>	 RECOVERY - mysqld processes #page on db1236 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[11:29:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi)
[11:30:31] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s7 #page on db1236 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:30:33] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s7 #page on db1236 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:30:33] <icinga-wm>	 RECOVERY - MariaDB read only s7 on db1236 is OK: Version 10.11.13-MariaDB-log, Uptime 76s, read_only: True, event_scheduler: True, 178.48 QPS, connection latency: 0.026488s, query latency: 0.000869s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[11:30:38] <Amir1>	 we should really just write something so it wouldn't page for depooled hosts
[11:30:41] <wikibugs>	 (03PS11) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[11:31:15] <jynus>	 The whole point is that it alerted before making a mistake
[11:31:33] <jynus>	 otherwise it is too late- it requires human intervention before, not after
[11:31:35] <claime>	 Amir1: Can you access the pooled state from the host itself?
[11:32:08] <Amir1>	 claime: I don't think so but it should be in https://noc.wikimedia.org/dbconfig/eqiad.json or equivalent in codfw 
[11:32:18] <Amir1>	 so matter of http request
[11:32:18] <jynus>	 although it could be downgraded to a non-p*ging one when depooled
[11:32:27] <wikibugs>	 (03PS12) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[11:32:30] <Amir1>	 yeah
[11:32:31] <claime>	 Amir1: Hmm, my thought was to dump the pooled state in the node-exporter config
[11:32:46] <claime>	 so you can check it from inside the am alert
[11:33:04] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2191.codfw.wmnet
[11:33:09] <claime>	 s/config/file/
[11:33:25] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2191 - Upgrading db2191.codfw.wmnet
[11:33:44] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2191 - Upgrading db2191.codfw.wmnet
[11:37:11] <Amir1>	 > 10:40 ladsgroup@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T404180
[11:37:11] <stashbot>	 T404180: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T404180
[11:37:25] <Amir1>	 The downtime isn't expired, something is removing the downtime
[11:39:03] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P83117 and previous config saved to /var/cache/conftool/dbconfig/20250910-113902-ladsgroup.json
[11:40:29] <wikibugs>	 (03CR) 10Cathal Mooney: "LGTM overall nice work!  We should probably try to apply it on a device see if there is any issue?" [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi)
[11:41:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:43:01] <Dreamy_Jazz>	 jouncebot: nowandnext
[11:43:02] <jouncebot>	 For the next 0 hour(s) and 16 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1100)
[11:43:02] <jouncebot>	 In 1 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1300)
[11:43:58] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[11:44:34] <Dreamy_Jazz>	 jouncebot: nowandnex
[11:44:37] <Dreamy_Jazz>	 jouncebot: update
[11:44:48] <Dreamy_Jazz>	 jouncebot: refresh
[11:44:49] <jouncebot>	 I refreshed my knowledge about deployments.
[11:44:54] <Dreamy_Jazz>	 jouncebot: nowandnext
[11:44:54] <jouncebot>	 For the next 0 hour(s) and 15 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1100)
[11:44:54] <jouncebot>	 In 0 hour(s) and 15 minute(s): Deployment of CheckUser Suggested Investigations signals (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1200)
[11:44:56] <wikibugs>	 (03PS1) 10KartikMistry: Update Recommendation API to 2025-09-10-080042-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186961 (https://phabricator.wikimedia.org/T403730)
[11:44:56] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2191 gradually with 4 steps - Upgrade of db2191.codfw.wmnet completed
[11:45:42] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2212.codfw.wmnet with reason: Maintenance
[11:45:50] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2212 (T402925)', diff saved to https://phabricator.wikimedia.org/P83120 and previous config saved to /var/cache/conftool/dbconfig/20250910-114549-ladsgroup.json
[11:45:54] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[11:47:32] <wikibugs>	 (03PS1) 10Btullis: Install the opensearch-operator-crd chart to the dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186964 (https://phabricator.wikimedia.org/T397246)
[11:48:42] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1236* gradually with 4 steps - Work done
[11:54:10] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P83122 and previous config saved to /var/cache/conftool/dbconfig/20250910-115409-ladsgroup.json
[11:56:12] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1173 gradually with 4 steps - Upgrade of db1173.eqiad.wmnet completed
[11:56:12] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1173.eqiad.wmnet
[11:57:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11166996 (10MoritzMuehlenhoff)
[11:59:54] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:00:04] <jouncebot>	 Dreamy_Jazz: May I have your attention please! Deployment of CheckUser Suggested Investigations signals. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1200)
[12:00:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T402763)', diff saved to https://phabricator.wikimedia.org/P83125 and previous config saved to /var/cache/conftool/dbconfig/20250910-120024-fceratto.json
[12:00:29] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[12:01:36] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:02:00] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] upgrade.py: Restart Prometheus exporter [cookbooks] - 10https://gerrit.wikimedia.org/r/1186532 (owner: 10Federico Ceratto)
[12:04:50] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.969 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:05:43] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] dse-k8s: Augment the dse-k8s cluster namespaces. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186487 (https://phabricator.wikimedia.org/T404068) (owner: 10Stevemunene)
[12:06:32] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 7.541 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:07:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Setup maps1011 as master node for new maps/eqiad servers [puppet] - 10https://gerrit.wikimedia.org/r/1186969 (https://phabricator.wikimedia.org/T381565)
[12:08:32] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186969 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[12:08:58] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-09-10-080042-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186961 (https://phabricator.wikimedia.org/T403730) (owner: 10KartikMistry)
[12:09:18] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T402763)', diff saved to https://phabricator.wikimedia.org/P83127 and previous config saved to /var/cache/conftool/dbconfig/20250910-120917-ladsgroup.json
[12:09:22] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[12:09:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:09:33] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1252.eqiad.wmnet with reason: Maintenance
[12:09:41] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T402763)', diff saved to https://phabricator.wikimedia.org/P83128 and previous config saved to /var/cache/conftool/dbconfig/20250910-120940-ladsgroup.json
[12:12:49] <wikibugs>	 (03PS2) 10Ayounsi: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958
[12:13:22] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s: Augment the dse-k8s cluster namespaces. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186487 (https://phabricator.wikimedia.org/T404068) (owner: 10Stevemunene)
[12:13:57] <wikibugs>	 (03PS3) 10Ayounsi: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958
[12:14:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi)
[12:16:31] <wikibugs>	 (03Merged) 10jenkins-bot: Update Recommendation API to 2025-09-10-080042-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186961 (https://phabricator.wikimedia.org/T403730) (owner: 10KartikMistry)
[12:17:11] <wikibugs>	 (03PS2) 10Sbisson: CX3 Build 1.0.0+20250909 [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886)
[12:19:19] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[12:20:25] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[12:21:09] <wikibugs>	 (03CR) 10Ayounsi: [WIP] Analytics and loopback ACLs (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi)
[12:21:38] <logmsgbot>	 !log kartik@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:23:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404104#11167076 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[12:25:40] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm
[12:25:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11167096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm
[12:26:04] <logmsgbot>	 !log kartik@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:29:44] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "looking good 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1186935 (owner: 10Slyngshede)
[12:30:11] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T402763)', diff saved to https://phabricator.wikimedia.org/P83131 and previous config saved to /var/cache/conftool/dbconfig/20250910-123011-ladsgroup.json
[12:30:16] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[12:30:24] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2191 gradually with 4 steps - Upgrade of db2191.codfw.wmnet completed
[12:30:25] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2191.codfw.wmnet
[12:31:18] <logmsgbot>	 !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:33:51] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[12:34:10] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1236* gradually with 4 steps - Work done
[12:34:38] <Dreamy_Jazz>	 I've finished with my window and so feel free to deploy etc. now
[12:36:36] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:37:12] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2207.codfw.wmnet with reason: Maintenance
[12:37:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T402763)', diff saved to https://phabricator.wikimedia.org/P83134 and previous config saved to /var/cache/conftool/dbconfig/20250910-123719-fceratto.json
[12:37:24] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[12:37:45] <wikibugs>	 (03CR) 10Cathal Mooney: [WIP] Analytics and loopback ACLs (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi)
[12:38:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T402763)', diff saved to https://phabricator.wikimedia.org/P83135 and previous config saved to /var/cache/conftool/dbconfig/20250910-123831-fceratto.json
[12:38:45] <wikibugs>	 (03PS1) 10Btullis: Revert "Fix the partman recipe for dse-k8s-worker1014" [puppet] - 10https://gerrit.wikimedia.org/r/1186974
[12:39:40] <Tchanders>	 Dreamy_Jazz: Thanks, I'll go ahead with the temporary accounts deployment, scheduled for the afternoon backport window
[12:39:54] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:42:15] <wikibugs>	 (03PS2) 10Btullis: Revert "Fix the partman recipe for dse-k8s-worker1014" [puppet] - 10https://gerrit.wikimedia.org/r/1186974 (https://phabricator.wikimedia.org/T399779)
[12:42:24] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] Revert "Fix the partman recipe for dse-k8s-worker1014" [puppet] - 10https://gerrit.wikimedia.org/r/1186974 (https://phabricator.wikimedia.org/T399779) (owner: 10Btullis)
[12:44:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) (owner: 10STran)
[12:44:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185838 (https://phabricator.wikimedia.org/T402181) (owner: 10STran)
[12:44:50] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.380 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:44:55] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[12:45:21] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P83136 and previous config saved to /var/cache/conftool/dbconfig/20250910-124518-ladsgroup.json
[12:45:25] <wikibugs>	 (03PS1) 10Ladsgroup: db2185: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186975 (https://phabricator.wikimedia.org/T394371)
[12:45:39] <wikibugs>	 (03Merged) 10jenkins-bot: Enable temporary accounts on all medium-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) (owner: 10STran)
[12:45:43] <wikibugs>	 (03Merged) 10jenkins-bot: Enable temporary accounts on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185838 (https://phabricator.wikimedia.org/T402181) (owner: 10STran)
[12:46:07] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[12:46:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11167133 (10BTullis) We decided that it's better to switch this back to legacy boot mode, since all of the other dse-k8s-workers are still using that. If we want to switch...
[12:46:28] <logmsgbot>	 !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1185845|Enable temporary accounts on all medium-sized projects (T403399)]], [[gerrit:1185838|Enable temporary accounts on metawiki (T402181)]]
[12:46:33] <stashbot>	 T403399: Deploy Temporary accounts to all medium-sized projects - https://phabricator.wikimedia.org/T403399
[12:46:34] <stashbot>	 T402181: Deploy Temporary accounts to all remaining small-sized projects - https://phabricator.wikimedia.org/T402181
[12:46:43] <wikibugs>	 (03PS5) 10Jforrester: Increase max recursion depth in the orchestrator's composition language. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 (https://phabricator.wikimedia.org/T403594) (owner: 10Cory Massaro)
[12:46:48] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2185.codfw.wmnet with reason: Glow up (T394371)
[12:46:54] <stashbot>	 T394371: Migrate to MariaDB 10.11 - https://phabricator.wikimedia.org/T394371
[12:46:54] <wikibugs>	 (03PS6) 10Jforrester: wikifunctions: Increase max recursion depth in the orchestrator's composition language [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 (https://phabricator.wikimedia.org/T403594) (owner: 10Cory Massaro)
[12:47:06] <stephanebisson>	 jouncebot now
[12:47:06] <jouncebot>	 For the next 0 hour(s) and 12 minute(s): Deployment of CheckUser Suggested Investigations signals (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1200)
[12:47:49] <wikibugs>	 (03PS2) 10Ladsgroup: db2185: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186975 (https://phabricator.wikimedia.org/T394371)
[12:47:54] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] db2185: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186975 (https://phabricator.wikimedia.org/T394371) (owner: 10Ladsgroup)
[12:48:03] <moritzm>	 !log installing unbound security updates on bullseyre
[12:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:05] <moritzm>	 !log installing unbound security updates on bullseye
[12:48:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:32] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-09-03-123051 to 2025-09-09-171717 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186978 (https://phabricator.wikimedia.org/T380941)
[12:50:43] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-09-04-003606 to 2025-09-08-191243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186979 (https://phabricator.wikimedia.org/T381061)
[12:50:51] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Pre-emptively disable Wikidata reference fetching [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186980 (https://phabricator.wikimedia.org/T399425)
[12:50:52] <logmsgbot>	 !log tchanders@deploy1003 tchanders, stran: Backport for [[gerrit:1185845|Enable temporary accounts on all medium-sized projects (T403399)]], [[gerrit:1185838|Enable temporary accounts on metawiki (T402181)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:50:54] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Enable (short-lived) caching of Wikidata items, as a test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186981 (https://phabricator.wikimedia.org/T397956)
[12:50:58] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Expand caching of Wikidata items TTL to one day [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186982 (https://phabricator.wikimedia.org/T397956)
[12:51:01] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2212.codfw.wmnet with reason: Maintenance
[12:51:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T401906)', diff saved to https://phabricator.wikimedia.org/P83137 and previous config saved to /var/cache/conftool/dbconfig/20250910-125108-fceratto.json
[12:51:12] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[12:52:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Set weight on db2203', diff saved to https://phabricator.wikimedia.org/P83138 and previous config saved to /var/cache/conftool/dbconfig/20250910-125216-fceratto.json
[12:52:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T401906)', diff saved to https://phabricator.wikimedia.org/P83139 and previous config saved to /var/cache/conftool/dbconfig/20250910-125224-fceratto.json
[12:53:32] <wikibugs>	 (03PS1) 10Ladsgroup: db1215: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186983 (https://phabricator.wikimedia.org/T394371)
[12:55:02] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1215.eqiad.wmnet with reason: Glow up (T399540 T394371)
[12:55:08] <stashbot>	 T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540
[12:55:08] <stashbot>	 T394371: Migrate to MariaDB 10.11 - https://phabricator.wikimedia.org/T394371
[12:57:13] <logmsgbot>	 !log tchanders@deploy1003 tchanders, stran: Continuing with sync
[12:58:02] <Tchanders>	 Tested that the account is attached to newly enabled wikis, and not attached to disabled wikis, so going ahead
[12:58:17] <wikibugs>	 (03PS2) 10Ladsgroup: db1215: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186983 (https://phabricator.wikimedia.org/T394371)
[12:58:30] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] db1215: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186983 (https://phabricator.wikimedia.org/T394371) (owner: 10Ladsgroup)
[12:59:54] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:00:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11167203 (10ayounsi)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1300).
[13:00:05] <jouncebot>	 stephanebisson and Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:24] <Tchanders>	 o/
[13:00:31] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250910-130026-ladsgroup.json
[13:00:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:00:41] <stephanebisson>	 o/
[13:00:52] <Tchanders>	 The previous window finished early so I got started on my deployment - it's just syncing out now
[13:01:27] <stephanebisson>	 Great! I'll go when you're done
[13:01:34] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:02:35] <logmsgbot>	 !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1185845|Enable temporary accounts on all medium-sized projects (T403399)]], [[gerrit:1185838|Enable temporary accounts on metawiki (T402181)]] (duration: 16m 06s)
[13:02:41] <stashbot>	 T403399: Deploy Temporary accounts to all medium-sized projects - https://phabricator.wikimedia.org/T403399
[13:02:41] <stashbot>	 T402181: Deploy Temporary accounts to all remaining small-sized projects - https://phabricator.wikimedia.org/T402181
[13:04:50] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.923 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:05:10] <kart_>	 !log Updated Recommendation API to 2025-09-10-080042-production (T403730, T403976, T400562)
[13:05:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:17] <stashbot>	 T403730: Treat article translation on mobile as (lead) section translation - https://phabricator.wikimedia.org/T403730
[13:05:18] <stashbot>	 T403976: Section suggestions: Appendix sections should not be considered as valid suggestions for sections - https://phabricator.wikimedia.org/T403976
[13:05:18] <icinga-wm>	 RECOVERY - Backup freshness on backup1014 is OK: Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[13:05:18] <stashbot>	 T400562: Create a unified Logstash dashboard displaying errors from cx, cxserver, RecommentationAPI, MinT - https://phabricator.wikimedia.org/T400562
[13:05:20] <kart_>	 Forgot to log this earlier ^^
[13:05:36] <kart_>	 stephanebisson: I'm around now if you need any help.
[13:05:53] <wikibugs>	 (03PS1) 10Ayounsi: Handle nokia interface name style [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1186985
[13:06:26] <stephanebisson>	 kart_ do you want to drive the deployment?
[13:07:05] <wikibugs>	 (03PS2) 10Ayounsi: Handle nokia interface name style [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1186985 (https://phabricator.wikimedia.org/T404146)
[13:07:10] <kart_>	 stephanebisson: sure.
[13:07:22] <wikibugs>	 (03CR) 10Ladsgroup: "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) (owner: 10STran)
[13:07:35] <kart_>	 stephanebisson: let's start with first patch?
[13:08:26] <stephanebisson>	 kart_ lets start with the "Desktop publish_success:..."
[13:08:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11167262 (10Jclark-ctr) @elukey Since we’re switching back to Legacy boot mode, I attempted to provision again, but it failed. When I went to manually change the BIOS, the...
[13:08:48] <wikibugs>	 (03PS3) 10Ayounsi: Handle nokia interface name style [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1186985 (https://phabricator.wikimedia.org/T404146)
[13:08:58] <kart_>	 stephanebisson: OK
[13:09:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186651 (https://phabricator.wikimedia.org/T402975) (owner: 10Sbisson)
[13:15:25] <Lucas_WMDE>	 o/
[13:15:38] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T402763)', diff saved to https://phabricator.wikimedia.org/P83141 and previous config saved to /var/cache/conftool/dbconfig/20250910-131538-ladsgroup.json
[13:15:43] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[13:15:47] <wikibugs>	 (03PS4) 10Ayounsi: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958
[13:15:54] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[13:17:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable the regular imports of the OSM updates and water lines on maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1186987 (https://phabricator.wikimedia.org/T381565)
[13:17:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Set weight on db2212', diff saved to https://phabricator.wikimedia.org/P83142 and previous config saved to /var/cache/conftool/dbconfig/20250910-131728-fceratto.json
[13:17:50] <moritzm>	 !log installing apache2 security updates
[13:17:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:03] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186987 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:18:20] <icinga-wm>	 PROBLEM - BFD status on ssw1-f1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:19:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between ssw1-f1-codfw and 2620:0:860:13f::23 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:20:24] <wikibugs>	 (03Merged) 10jenkins-bot: Desktop publish_success: add revid and pageid [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186651 (https://phabricator.wikimedia.org/T402975) (owner: 10Sbisson)
[13:20:51] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1186651|Desktop publish_success: add revid and pageid (T402975)]]
[13:20:55] <stashbot>	 T402975: CX event: desktop `publish_success` events don't have published_revision_id and published_page_id - https://phabricator.wikimedia.org/T402975
[13:21:43] <wikibugs>	 (03PS11) 10Ayounsi: Use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407
[13:21:51] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1186988 (https://phabricator.wikimedia.org/T404192)
[13:22:10] <wikibugs>	 (03CR) 10Ayounsi: Use Homer to configure the network (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi)
[13:22:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P83143 and previous config saved to /var/cache/conftool/dbconfig/20250910-132239-fceratto.json
[13:23:03] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T404192
[13:23:07] <stashbot>	 T404192: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T404192
[13:23:20] <icinga-wm>	 PROBLEM - BFD status on ssw1-e1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:24:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between ssw1-e1-codfw and 2620:0:860:13f::23 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:24:48] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1233-1236].eqiad.wmnet
[13:26:44] <wikibugs>	 (03PS13) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935
[13:26:55] <logmsgbot>	 !log kartik@deploy1003 kartik, sbisson: Backport for [[gerrit:1186651|Desktop publish_success: add revid and pageid (T402975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:26:59] <stashbot>	 T402975: CX event: desktop `publish_success` events don't have published_revision_id and published_page_id - https://phabricator.wikimedia.org/T402975
[13:27:22] <kart_>	 stephanebisson: you can test the patch.
[13:27:28] <stephanebisson>	 on it
[13:29:06] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Setup maps1011 as master node for new maps/eqiad servers [puppet] - 10https://gerrit.wikimedia.org/r/1186969 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:29:22] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1233-1236].eqiad.wmnet
[13:30:02] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1186988 (https://phabricator.wikimedia.org/T404192) (owner: 10Gerrit maintenance bot)
[13:30:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:30:28] <stephanebisson>	 Kart_ All good, go ahead
[13:30:45] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1235.eqiad.wmnet
[13:31:05] <kart_>	 stephanebisson: sure
[13:31:11] <logmsgbot>	 !log kartik@deploy1003 kartik, sbisson: Continuing with sync
[13:31:33] <wikibugs>	 (03CR) 10Slyngshede: P:cache::haproxy unittests for Lua module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1186935 (owner: 10Slyngshede)
[13:31:34] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:31:39] <federico3>	 !log Starting s8 codfw failover from db2161 to db2165 - T404192
[13:31:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:42] <stashbot>	 T404192: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T404192
[13:31:46] <wikibugs>	 (03CR) 10Bking: [C:03+2] "Looks great, thank you for helping out here!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186964 (https://phabricator.wikimedia.org/T397246) (owner: 10Btullis)
[13:32:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2165 to s8 primary T404192', diff saved to https://phabricator.wikimedia.org/P83145 and previous config saved to /var/cache/conftool/dbconfig/20250910-133231-fceratto.json
[13:32:36] <kart_>	 stephanebisson: I'll also +2 in advance for CX build patch
[13:33:01] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS trixie
[13:33:35] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 (owner: 10Slyngshede)
[13:33:44] <kart_>	 stephanebisson: Is that patch updated? I saw some updates in master.
[13:33:58] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[13:34:40] <stephanebisson>	 Kart_ I updated the branch too but I did it manually because the version in master was also rebased and that included unwanted changes
[13:34:54] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:35:22] <kart_>	 stephanebisson: OK. Let's +2.
[13:35:45] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] CX3 Build 1.0.0+20250909 [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886) (owner: 10Sbisson)
[13:36:32] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186651|Desktop publish_success: add revid and pageid (T402975)]] (duration: 15m 41s)
[13:36:37] <stashbot>	 T402975: CX event: desktop `publish_success` events don't have published_revision_id and published_page_id - https://phabricator.wikimedia.org/T402975
[13:36:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886) (owner: 10Sbisson)
[13:37:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:37:22] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2147.codfw.wmnet with reason: Maintenance
[13:37:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:37:30] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2147 (T402763)', diff saved to https://phabricator.wikimedia.org/P83146 and previous config saved to /var/cache/conftool/dbconfig/20250910-133729-ladsgroup.json
[13:37:34] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[13:37:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T401906)', diff saved to https://phabricator.wikimedia.org/P83147 and previous config saved to /var/cache/conftool/dbconfig/20250910-133746-fceratto.json
[13:37:51] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[13:37:53] <denisse>	 !incidents
[13:37:53] <sirenbot>	 6723 (RESOLVED)  db1236 (paged)/MariaDB Replica SQL: s7 (paged)
[13:37:53] <sirenbot>	 6724 (RESOLVED)  db1236 (paged)/MariaDB Replica IO: s7 (paged)
[13:37:53] <sirenbot>	 6725 (RESOLVED)  db1236 (paged)/mysqld processes (paged)
[13:38:53] <wikibugs>	 (03Merged) 10jenkins-bot: Install the opensearch-operator-crd chart to the dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186964 (https://phabricator.wikimedia.org/T397246) (owner: 10Btullis)
[13:40:39] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2161.codfw.wmnet with reason: Maintenance
[13:40:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T402763)', diff saved to https://phabricator.wikimedia.org/P83148 and previous config saved to /var/cache/conftool/dbconfig/20250910-134046-fceratto.json
[13:42:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.281s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:42:19] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon2004-dev.codfw.wmnet with OS trixie
[13:42:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:43:30] <wikibugs>	 (03CR) 10Btullis: "Yes, I think that the 'templates' path element is required. I didn't spot it at first, so when I tried a `rake run_locally` with only the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186964 (https://phabricator.wikimedia.org/T397246) (owner: 10Btullis)
[13:45:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:45:53] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:46:20] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:47:29] <wikibugs>	 (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20250909 [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886) (owner: 10Sbisson)
[13:47:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T402763)', diff saved to https://phabricator.wikimedia.org/P83149 and previous config saved to /var/cache/conftool/dbconfig/20250910-134734-fceratto.json
[13:47:39] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[13:47:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:47:56] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1186650|CX3 Build 1.0.0+20250909 (T374886 T394998 T399122 T399125 T399133 T403730 T404045 T404093)]]
[13:48:13] <wikibugs>	 (03Abandoned) 10Brouberol: runner: redact mysql password from the command string when reporting an error [dumps] - 10https://gerrit.wikimedia.org/r/1186930 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol)
[13:48:16] <stashbot>	 T374886: SX: Use source/target languages from URL params everywhere - https://phabricator.wikimedia.org/T374886
[13:48:16] <stashbot>	 T394998: Translation time estimations are very underestimated - https://phabricator.wikimedia.org/T394998
[13:48:16] <stashbot>	 T399122: Show aggregate section information with difficulty indicators in “Expand with new sections” list - https://phabricator.wikimedia.org/T399122
[13:48:17] <stashbot>	 T399125: Instrumentation: log recommendation difficulty level - https://phabricator.wikimedia.org/T399125
[13:48:17] <stashbot>	 T399133: Show easy recommendations to beginners - https://phabricator.wikimedia.org/T399133
[13:48:17] <stashbot>	 T403730: Treat article translation on mobile as (lead) section translation - https://phabricator.wikimedia.org/T403730
[13:48:18] <stashbot>	 T404045: CX Unified Dashboard: Favorite suggestions display current languages instead of the suggestion languages - https://phabricator.wikimedia.org/T404045
[13:48:18] <stashbot>	 T404093: Decide article and section difficulty level size thresholds - https://phabricator.wikimedia.org/T404093
[13:48:23] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1235.eqiad.wmnet
[13:48:31] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[13:48:58] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[13:49:44] <wikibugs>	 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11167514 (10elukey) ==== Error Budget calculations ====  The grafana dashboard's JSON shows these  two expressions:  ` "description": "This graph shows the month error budget burn down chart (starts the 1st...
[13:49:52] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.543 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:50:03] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[13:51:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2161 weight based on db2165', diff saved to https://phabricator.wikimedia.org/P83150 and previous config saved to /var/cache/conftool/dbconfig/20250910-135119-fceratto.json
[13:53:51] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:54:25] <logmsgbot>	 !log kartik@deploy1003 sbisson, kartik: Backport for [[gerrit:1186650|CX3 Build 1.0.0+20250909 (T374886 T394998 T399122 T399125 T399133 T403730 T404045 T404093)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:54:40] <stashbot>	 T374886: SX: Use source/target languages from URL params everywhere - https://phabricator.wikimedia.org/T374886
[13:54:40] <stashbot>	 T394998: Translation time estimations are very underestimated - https://phabricator.wikimedia.org/T394998
[13:54:40] <stashbot>	 T399122: Show aggregate section information with difficulty indicators in “Expand with new sections” list - https://phabricator.wikimedia.org/T399122
[13:54:41] <stashbot>	 T399125: Instrumentation: log recommendation difficulty level - https://phabricator.wikimedia.org/T399125
[13:54:41] <stashbot>	 T399133: Show easy recommendations to beginners - https://phabricator.wikimedia.org/T399133
[13:54:42] <stashbot>	 T403730: Treat article translation on mobile as (lead) section translation - https://phabricator.wikimedia.org/T403730
[13:54:42] <stashbot>	 T404045: CX Unified Dashboard: Favorite suggestions display current languages instead of the suggestion languages - https://phabricator.wikimedia.org/T404045
[13:54:43] <stashbot>	 T404093: Decide article and section difficulty level size thresholds - https://phabricator.wikimedia.org/T404093
[13:55:26] <kart_>	 stephanebisson: ready for testing!
[13:55:31] <stephanebisson>	 on it
[13:55:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Reset weights on db2212 and db2203', diff saved to https://phabricator.wikimedia.org/P83151 and previous config saved to /var/cache/conftool/dbconfig/20250910-135553-fceratto.json
[13:56:40] <wikibugs>	 (03CR) 10Scott French: "Many thanks for the review, Valentin." [puppet] - 10https://gerrit.wikimedia.org/r/1184914 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French)
[13:56:46] <wikibugs>	 (03CR) 10Scott French: "+cc @cgoubert@wikimedia.org FYI, since this will conflict structurally, but not functionally, with Ibe367b528408886f34748e1b935b192a6d8c33" [puppet] - 10https://gerrit.wikimedia.org/r/1184915 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French)
[13:57:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Set weight on db2204', diff saved to https://phabricator.wikimedia.org/P83152 and previous config saved to /var/cache/conftool/dbconfig/20250910-135720-fceratto.json
[13:57:29] <wikibugs>	 (03CR) 10Volans: "Great work, thanks a lot to have kept iterating on it! I've left some questions, couple of possible small issues and few nits inline. I th" [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney)
[13:57:58] <wikibugs>	 10SRE-SLO, 10Observability-Metrics: Clear & Backfill Tonecheck Pyrra Metrics - https://phabricator.wikimedia.org/T400071#11167580 (10herron) p:05Triage→03Medium
[13:58:23] <wikibugs>	 10SRE-SLO, 10Observability-Metrics: Clear & Backfill Tonecheck Pyrra Metrics - https://phabricator.wikimedia.org/T400071#11167591 (10herron) 05Open→03Resolved Tonecheck metrics have been backfilled with a clean history
[13:58:40] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Temp bump to 6 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186996 (https://phabricator.wikimedia.org/T402412)
[14:00:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1400)
[14:00:22] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-09-03-123051 to 2025-09-09-171717 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186978 (https://phabricator.wikimedia.org/T380941) (owner: 10Jforrester)
[14:01:48] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T402763)', diff saved to https://phabricator.wikimedia.org/P83153 and previous config saved to /var/cache/conftool/dbconfig/20250910-140147-ladsgroup.json
[14:01:52] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2161.codfw.wmnet with reason: Maintenance
[14:01:53] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[14:01:59] <kart_>	 James_F: We're still deploying. I'll ping once done.
[14:02:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T401906)', diff saved to https://phabricator.wikimedia.org/P83154 and previous config saved to /var/cache/conftool/dbconfig/20250910-140159-fceratto.json
[14:02:04] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[14:02:07] <James_F>	 kart_: It's fine, services not MW land.
[14:02:16] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thank you! While ideally this probably wouldn't be needed, it would be nice not to have to think about it while making the DNS changes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186996 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert)
[14:02:19] <kart_>	 Thanks
[14:02:19] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-09-03-123051 to 2025-09-09-171717 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186978 (https://phabricator.wikimedia.org/T380941) (owner: 10Jforrester)
[14:02:43] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Temp bump to 6 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186996 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert)
[14:03:49] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:03:51] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:04:07] <wikibugs>	 (03PS1) 10Btullis: Add four new (renamed) an-worker nodes to the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1186998 (https://phabricator.wikimedia.org/T398438)
[14:04:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T401906)', diff saved to https://phabricator.wikimedia.org/P83155 and previous config saved to /var/cache/conftool/dbconfig/20250910-140410-fceratto.json
[14:04:18] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Temp bump to 6 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186996 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert)
[14:04:19] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:04:24] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] P:trafficserver::backend: add mw-next-routing [puppet] - 10https://gerrit.wikimedia.org/r/1184914 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French)
[14:04:30] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[14:04:31] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:04:35] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[14:04:42] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[14:04:51] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[14:04:56] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[14:05:02] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[14:05:35] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:05:39] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:06:21] <stephanebisson>	 kart_ you can go ahed
[14:06:24] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:06:28] <kart_>	 cool
[14:06:28] <stephanebisson>	 kart_ you can go ahead
[14:06:32] <logmsgbot>	 !log kartik@deploy1003 sbisson, kartik: Continuing with sync
[14:06:42] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-09-04-003606 to 2025-09-08-191243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186979 (https://phabricator.wikimedia.org/T381061) (owner: 10Jforrester)
[14:07:08] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hieradata: add mw-next-routing to ATS tslua plugin chains [puppet] - 10https://gerrit.wikimedia.org/r/1184915 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French)
[14:08:30] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-09-04-003606 to 2025-09-08-191243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186979 (https://phabricator.wikimedia.org/T381061) (owner: 10Jforrester)
[14:08:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add four new (renamed) an-worker nodes to the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1186998 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[14:08:36] <wikibugs>	 (03CR) 10Ssingh: wmnet: Introduce rest-gateway-ro (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1182853 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert)
[14:09:10] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Add four new (renamed) an-worker nodes to the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1186998 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[14:09:53] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:10:20] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:10:39] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:11:14] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:11:21] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:12:05] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186650|CX3 Build 1.0.0+20250909 (T374886 T394998 T399122 T399125 T399133 T403730 T404045 T404093)]] (duration: 24m 08s)
[14:12:19] <stashbot>	 T374886: SX: Use source/target languages from URL params everywhere - https://phabricator.wikimedia.org/T374886
[14:12:20] <stashbot>	 T394998: Translation time estimations are very underestimated - https://phabricator.wikimedia.org/T394998
[14:12:20] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:12:20] <stashbot>	 T399122: Show aggregate section information with difficulty indicators in “Expand with new sections” list - https://phabricator.wikimedia.org/T399122
[14:12:20] <stashbot>	 T399125: Instrumentation: log recommendation difficulty level - https://phabricator.wikimedia.org/T399125
[14:12:21] <stashbot>	 T399133: Show easy recommendations to beginners - https://phabricator.wikimedia.org/T399133
[14:12:22] <stashbot>	 T403730: Treat article translation on mobile as (lead) section translation - https://phabricator.wikimedia.org/T403730
[14:12:22] <stashbot>	 T404045: CX Unified Dashboard: Favorite suggestions display current languages instead of the suggestion languages - https://phabricator.wikimedia.org/T404045
[14:12:23] <stashbot>	 T404093: Decide article and section difficulty level size thresholds - https://phabricator.wikimedia.org/T404093
[14:12:50] <wikibugs>	 (03PS1) 10Andrew Bogott: eqiad1 cloudcontrols => ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187000
[14:13:37] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Increase max recursion depth in the orchestrator's composition language [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 (https://phabricator.wikimedia.org/T403594) (owner: 10Cory Massaro)
[14:13:37] <kart_>	 stephanebisson86: all done
[14:13:55] <stephanebisson86>	 kart_ thanks!
[14:14:46] <wikibugs>	 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11167669 (10elukey) ==== Alerting ====  Adding some alerting rules adds the following:  ` - name: sloth-slo-alerts-citoid-requests-availability   rules:   - alert: CitoidHighErrorRate     expr: |       (...
[14:14:54] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:15:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] eqiad1 cloudcontrols => ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187000 (owner: 10Andrew Bogott)
[14:16:06] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Increase max recursion depth in the orchestrator's composition language [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 (https://phabricator.wikimedia.org/T403594) (owner: 10Cory Massaro)
[14:16:32] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:16:56] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P83156 and previous config saved to /var/cache/conftool/dbconfig/20250910-141655-ladsgroup.json
[14:17:14] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:17:38] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:18:11] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add four new (renamed) an-worker nodes to the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1186998 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[14:19:16] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:19:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P83157 and previous config saved to /var/cache/conftool/dbconfig/20250910-141917-fceratto.json
[14:19:41] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:19:55] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:20:10] <James_F>	 kart_: Are you complete at your end?
[14:20:30] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:20:39] <wikibugs>	 (03CR) 10Btullis: [C:03+1] druid: Bring druid1012.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/1182698 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene)
[14:20:44] <wikibugs>	 (03CR) 10Btullis: [C:03+1] druid: Bring druid1013.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/1182699 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene)
[14:20:56] <wikibugs>	 (03CR) 10Btullis: [C:03+1] druid: Add druid druid101[2-3] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1182700 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene)
[14:20:58] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Pre-emptively disable Wikidata reference fetching [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186980 (https://phabricator.wikimedia.org/T399425) (owner: 10Jforrester)
[14:21:05] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Enable (short-lived) caching of Wikidata items, as a test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186981 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester)
[14:21:17] <wikibugs>	 (03CR) 10Btullis: [C:03+1] druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene)
[14:22:48] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Pre-emptively disable Wikidata reference fetching [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186980 (https://phabricator.wikimedia.org/T399425) (owner: 10Jforrester)
[14:22:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:22:59] <wikibugs>	 (03PS4) 10JHathaway: provision: on reboot wait for bios attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1186619
[14:23:03] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Enable (short-lived) caching of Wikidata items, as a test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186981 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester)
[14:24:31] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:24:48] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:25:11] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:25:32] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:25:37] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:26:00] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:26:15] <kart_>	 James_F: sorry. Yes. Done.
[14:26:22] <James_F>	 kart_: Awesome, no worries.
[14:26:39] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186953 (owner: 10PipelineBot)
[14:27:15] <wikibugs>	 (03PS1) 10Scott French: shellbox-constraints: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186576 (https://phabricator.wikimedia.org/T403284)
[14:27:53] <wikibugs>	 (03PS1) 10Andrew Bogott: eqiad1: fix hiera key for moving cloudcontrols to ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187004
[14:27:59] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187004 (owner: 10Andrew Bogott)
[14:28:28] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186953 (owner: 10PipelineBot)
[14:28:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:28:58] <jinxer-wm>	 FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:29:20] <icinga-wm>	 RECOVERY - BFD status on ssw1-f1-codfw.mgmt is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:29:20] <icinga-wm>	 RECOVERY - BFD status on ssw1-e1-codfw.mgmt is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:30:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] eqiad1: fix hiera key for moving cloudcontrols to ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187004 (owner: 10Andrew Bogott)
[14:30:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1400)
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1430)
[14:30:09] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[14:30:30] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[14:31:18] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Expand caching of Wikidata items TTL to one minute [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187006 (https://phabricator.wikimedia.org/T397956)
[14:31:26] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Expand caching of Wikidata items TTL to one minute [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187006 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester)
[14:31:32] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 7.666 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:32:03] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P83158 and previous config saved to /var/cache/conftool/dbconfig/20250910-143202-ladsgroup.json
[14:32:50] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[14:33:09] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Expand caching of Wikidata items TTL to one minute [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187006 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester)
[14:33:40] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[14:34:02] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[14:34:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between ssw1-e1-codfw and 2620:0:860:13f::23 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:34:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P83159 and previous config saved to /var/cache/conftool/dbconfig/20250910-143424-fceratto.json
[14:34:29] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:34:47] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[14:34:52] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:35:13] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:35:26] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:36:01] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:36:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186517 (owner: 10Jforrester)
[14:36:23] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:37:25] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:trafficserver::backend: add mw-next-routing [puppet] - 10https://gerrit.wikimedia.org/r/1184914 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French)
[14:38:09] <wikibugs>	 06SRE, 10DNS, 06FR-donorrelations, 06Traffic: Custom URL for survey pop-up - https://phabricator.wikimedia.org/T400278#11167815 (10ssingh) Hi @EBrill-WMF: Happy to set up a time to talk about this; please let me know and I can set that up. Thanks.
[14:39:21] <wikibugs>	 (03PS5) 10Ayounsi: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958
[14:39:55] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.484 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:40:14] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] shellbox-constraints: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186576 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[14:40:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Setup maps1011 as master node for new maps/eqiad servers [puppet] - 10https://gerrit.wikimedia.org/r/1186969 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:41:08] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] provision: on reboot wait for bios attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1186619 (owner: 10JHathaway)
[14:41:17] <wikibugs>	 (03Merged) 10jenkins-bot: Improve performance of preferred labels subquery [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186517 (owner: 10Jforrester)
[14:41:46] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1186517|Improve performance of preferred labels subquery]]
[14:43:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: flip debdeploy::client::ensure to present [puppet] - 10https://gerrit.wikimedia.org/r/1186937 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi)
[14:43:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: exclude nfs/nfs4 from debdeploy::client in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1186938 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi)
[14:45:37] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:46:25] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:46:29] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Expand caching of Wikidata items TTL to one day [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186982 (https://phabricator.wikimedia.org/T397956)
[14:47:11] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T402763)', diff saved to https://phabricator.wikimedia.org/P83160 and previous config saved to /var/cache/conftool/dbconfig/20250910-144710-ladsgroup.json
[14:47:14] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[14:47:25] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2155.codfw.wmnet with reason: Maintenance
[14:47:33] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T402763)', diff saved to https://phabricator.wikimedia.org/P83161 and previous config saved to /var/cache/conftool/dbconfig/20250910-144732-ladsgroup.json
[14:47:39] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1186517|Improve performance of preferred labels subquery]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:48:38] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Change all druid_public hosts references to use svc url [puppet] - 10https://gerrit.wikimedia.org/r/1185922 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene)
[14:49:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T401906)', diff saved to https://phabricator.wikimedia.org/P83162 and previous config saved to /var/cache/conftool/dbconfig/20250910-144932-fceratto.json
[14:49:37] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[14:49:40] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:50:08] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[14:51:06] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[14:53:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.288s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:54:05] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): Audit Moderator Tools applications for use of the "m." sub-domain - https://phabricator.wikimedia.org/T404207 (10Kgraessle) 03NEW
[14:54:48] <wikibugs>	 (03CR) 10Jforrester: [C:04-1] "Still investigating." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186982 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester)
[14:55:30] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186517|Improve performance of preferred labels subquery]] (duration: 13m 44s)
[14:55:34] <logmsgbot>	 !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=rest-gateway,name=codfw [reason: Depooling codfw ahead of switch to active-passive - T400131]
[14:55:38] <stashbot>	 T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131
[14:55:48] <James_F>	 OK, all done here.
[14:56:09] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap::master: Update advise in /srv/patches git pre-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672)
[14:56:57] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2212 gradually with 4 steps - pooling in
[14:57:01] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2212 gradually with 4 steps - pooling in
[14:57:31] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:58:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.67s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:59:27] <wikibugs>	 (03PS1) 10Muehlenhoff: autoinstall: Stop pulling in udebs from unstable now that trixie is stable [puppet] - 10https://gerrit.wikimedia.org/r/1187012
[15:02:29] <wikibugs>	 (03PS2) 10Muehlenhoff: autoinstall: Stop pulling in udebs from unstable now that trixie is stable [puppet] - 10https://gerrit.wikimedia.org/r/1187012
[15:03:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] autoinstall: Stop pulling in udebs from unstable now that trixie is stable [puppet] - 10https://gerrit.wikimedia.org/r/1187012 (owner: 10Muehlenhoff)
[15:05:19] <wikibugs>	 (03CR) 10SBassett: [C:03+1] "Not sure if all secteam folks are sudoers on deployment though..." [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[15:06:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:08:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2204 and db2207 weights after flip', diff saved to https://phabricator.wikimedia.org/P83163 and previous config saved to /var/cache/conftool/dbconfig/20250910-150800-fceratto.json
[15:08:27] <wikibugs>	 (03CR) 10Ahmon Dancy: "ooh, good point.  Can you run `sudo -l` on the deploy server and see if fix-staging-perms is listed?" [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[15:08:58] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[15:08:58] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:50] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply
[15:10:57] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS trixie
[15:11:44] <wikibugs>	 (03CR) 10SBassett: [C:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[15:12:18] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T402763)', diff saved to https://phabricator.wikimedia.org/P83164 and previous config saved to /var/cache/conftool/dbconfig/20250910-151216-ladsgroup.json
[15:12:22] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[15:13:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11167999 (10elukey) @Jclark-ctr I synced with Jesse and he didn't make anything during yesterday's tests that could end up in this situation, so maybe a reset to factory defaults could help to s...
[15:14:17] <wikibugs>	 (03PS1) 10Xcollazo: dumps: disable rsync access for 2 dead dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1187016 (https://phabricator.wikimedia.org/T402987)
[15:16:31] <logmsgbot>	 !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=rest-gateway,name=codfw [reason: Repooling codfw while investigating provisioning of proton service - T400131]
[15:16:36] <stashbot>	 T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131
[15:18:54] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[15:19:02] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[15:22:23] <papaul>	 !log disable OSPF on mr1-eqsin to test BGP 
[15:22:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:34] <logmsgbot>	 !log pt1979@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mr1-eqsin with reason: router upgrade
[15:22:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11168027 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=eb9214cb-2708-4519-b3c6-38d4f0f7cf7d) set by pt1979@cumin1002 for 1:00:00 o...
[15:23:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Create component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187019 (https://phabricator.wikimedia.org/T404114)
[15:23:58] <wikibugs>	 (03PS1) 10Muehlenhoff: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114)
[15:26:34] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:26:46] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[15:26:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:27:19] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:27:19] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:27:26] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P83165 and previous config saved to /var/cache/conftool/dbconfig/20250910-152725-ladsgroup.json
[15:27:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Trove: install backdoor VM keys on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/1181185 (https://phabricator.wikimedia.org/T402317) (owner: 10Andrew Bogott)
[15:27:48] <wikibugs>	 (03PS1) 10Scott French: admin_ng: bump cpu quota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131)
[15:28:07] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:28:10] <rzl>	 jouncebot: nowandnext
[15:28:10] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 31 minute(s)
[15:28:10] <jouncebot>	 In 1 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1700)
[15:28:17] <wikibugs>	 06SRE, 10bacula, 10Data-Persistence-Backup, 10Infrastructure Security, and 3 others: Trixie bacula-fd package incompatible with our bacula installation - https://phabricator.wikimedia.org/T404114#11168066 (10fgiunchedi) +SRE for visibility
[15:28:34] <rzl>	 borrowing mw-debug in codfw for a quick experiment, holler if anyone needs to deploy anything :)
[15:29:16] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:29:25] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "Fine to go now" [puppet] - 10https://gerrit.wikimedia.org/r/1187019 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:30:08] <Lucas_WMDE>	 rzl: mw-debug or mw-experimental? :) https://wikitech.wikimedia.org/wiki/Mw-experimental
[15:30:29] <Lucas_WMDE>	 (I don’t remember if you were involved in that so mentioning it just in case 😇)
[15:30:39] <wikibugs>	 (03PS1) 10Clare Ming: xLab: Deploy v1.0.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187022
[15:30:45] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[15:31:23] <rzl>	 Lucas_WMDE: debug! testing a helmfile change not a MW code change so not experimental's wheelhouse but I appreciate it anyway :)
[15:31:28] <wikibugs>	 (03CR) 10Jcrespo: "This is ok as it is, but, as I mentioned on the ticket, can we move the decision to the profile, as the logic will change in the future? I" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:31:30] <Lucas_WMDE>	 ok :)
[15:32:07] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v1.0.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187022 (owner: 10Clare Ming)
[15:33:13] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:33:49] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v1.0.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187022 (owner: 10Clare Ming)
[15:33:58] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[15:33:58] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:07] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Remove OSFP from mr1-eqsin and cr2/3-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1186682 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul)
[15:34:11] <wikibugs>	 (03CR) 10Jcrespo: "Actually, I have an additional question- to revert it, what actions are needed? will removing the repo setup be enough for a clean upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:34:21] <wikibugs>	 (03CR) 10Muehlenhoff: "Not sure we even need to make ot configurable? It will be needed for all Trixie nodes and once the server side is updated it will be neede" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:35:09] <logmsgbot>	 btullis@cumin1003 roll-restart-masters (PID 1963407) is awaiting input
[15:35:23] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:35:25] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:36:30] <wikibugs>	 (03CR) 10Muehlenhoff: "Yes, once the server side is compatible, we:" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:36:32] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:38:15] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[15:38:57] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] admin_ng: bump cpu quota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French)
[15:39:08] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] admin_ng: bump cpu quota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French)
[15:40:16] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[15:40:42] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:41:12] <rzl>	 done with mw-debug for now, might go again before the 17:00 infra window, remains to be seen
[15:41:57] <wikibugs>	 (03PS1) 10Papaul: Fix typo on ospf3 [homer/public] - 10https://gerrit.wikimedia.org/r/1187024 (https://phabricator.wikimedia.org/T294845)
[15:43:31] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Fix typo on ospf3 [homer/public] - 10https://gerrit.wikimedia.org/r/1187024 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul)
[15:43:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Swap db2213 and 2223 weights', diff saved to https://phabricator.wikimedia.org/P83166 and previous config saved to /var/cache/conftool/dbconfig/20250910-154331-fceratto.json
[15:43:41] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P83167 and previous config saved to /var/cache/conftool/dbconfig/20250910-154340-ladsgroup.json
[15:44:36] <wikibugs>	 (03CR) 10Jcrespo: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:44:38] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French)
[15:44:42] <wikibugs>	 (03CR) 10Scott French: [C:03+2] admin_ng: bump cpu quota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French)
[15:49:31] <rzl>	 borrowing mw-debug codfw again after all, don't mind me
[15:49:38] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[15:49:48] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[15:50:10] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:50:11] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[15:51:46] <wikibugs>	 (03PS2) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:51:59] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: bump cpu quota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French)
[15:53:26] <wikibugs>	 (03CR) 10Jcrespo: "It is almost the same thing, but I want to protect against myself: 1) on server upgrade, 2) on cloud services, 3) on next os upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:53:34] <wikibugs>	 (03CR) 10Elukey: Introduce v1 xLab / MPIC SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt)
[15:54:30] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS trixie
[15:54:47] <wikibugs>	 (03PS3) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:55:39] <wikibugs>	 (03PS4) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:56:32] <wikibugs>	 (03PS1) 10Clare Ming: xLab: Deploy v1.0.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187027 (https://phabricator.wikimedia.org/T371225)
[15:56:50] <wikibugs>	 (03CR) 10Jcrespo: "The idea is, when testing 15 on the storage server, I will be able to upgrade one host at a time, hence the hiera key." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:57:13] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:57:48] <wikibugs>	 (03Abandoned) 10Clare Ming: xLab: Deploy v1.0.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186616 (https://phabricator.wikimedia.org/T371225) (owner: 10Clare Ming)
[15:58:02] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:58:39] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v1.0.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187027 (https://phabricator.wikimedia.org/T371225) (owner: 10Clare Ming)
[15:58:48] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T402763)', diff saved to https://phabricator.wikimedia.org/P83168 and previous config saved to /var/cache/conftool/dbconfig/20250910-155847-ladsgroup.json
[15:58:52] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[15:59:02] <wikibugs>	 (03CR) 10Jcrespo: "I wouldn't mind an extra pair of eyes, I am a bit distracted and making mistakes." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[15:59:04] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2172.codfw.wmnet with reason: Maintenance
[15:59:12] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T402763)', diff saved to https://phabricator.wikimedia.org/P83169 and previous config saved to /var/cache/conftool/dbconfig/20250910-155911-ladsgroup.json
[15:59:30] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[16:00:13] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v1.0.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187027 (https://phabricator.wikimedia.org/T371225) (owner: 10Clare Ming)
[16:00:35] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[16:01:47] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[16:02:07] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[16:03:05] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply
[16:03:25] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply
[16:04:09] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[16:04:32] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[16:05:04] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[16:05:34] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm
[16:05:51] <wikibugs>	 (03PS1) 10Btullis: Temporarily exlude the 4 new hadoop workers to facilitate vlan change [puppet] - 10https://gerrit.wikimedia.org/r/1187029 (https://phabricator.wikimedia.org/T398438)
[16:06:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) (owner: 10NMW03)
[16:06:57] <wikibugs>	 (03PS5) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[16:07:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[16:10:12] <wikibugs>	 (03PS6) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[16:11:50] <logmsgbot>	 !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=rest-gateway,name=codfw [reason: Depooling codfw ahead of switch to active-passive - T400131]
[16:11:54] <stashbot>	 T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131
[16:12:32] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187029 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[16:12:49] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[16:14:33] <wikibugs>	 (03CR) 10DCausse: [C:03+1] cirrus: Reduce galleries weight in search on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson)
[16:15:02] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "This is ok to me, but I am not saying it has to be like this, feel free to critizise it further." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[16:16:30] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "This will allow me to do hieradata/hosts/people1005.yaml: profile::backup::client_version: 15 # individually for testing." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[16:16:35] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:17:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frmx2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403970#11168370 (10Jhancock.wm)
[16:18:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11168372 (10Papaul) 05Open→03Resolved  mr1-eqsin and cr2/3-eqsin are now running BGP for the management network. Resolving this task. Thanks @ay...
[16:18:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403965#11168374 (10Jhancock.wm)
[16:21:25] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:23:23] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[16:23:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frmx2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403970#11168432 (10Jhancock.wm) a:03Papaul
[16:24:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403965#11168433 (10Jhancock.wm) a:03Papaul
[16:24:22] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T402763)', diff saved to https://phabricator.wikimedia.org/P83170 and previous config saved to /var/cache/conftool/dbconfig/20250910-162421-ladsgroup.json
[16:24:26] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[16:24:39] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live (is_pageview is always "-") - https://phabricator.wikimedia.org/T402612#11168439 (10Ottomata) @JAllemandou can you weigh in here?  @CDanis def possible to reuse the logic in Java, but it would probably be a la...
[16:26:21] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply
[16:27:40] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply
[16:28:48] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[16:39:29] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P83171 and previous config saved to /var/cache/conftool/dbconfig/20250910-163929-ladsgroup.json
[16:40:18] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox-constraints: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186576 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[16:40:53] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks Scott. I looked at a sampling of other security team users and they're all in the `deployment` group which is what grants this priv" [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[16:42:00] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-constraints: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186576 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[16:42:46] <wikibugs>	 (03PS7) 10Cathal Mooney: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577)
[16:43:46] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[16:43:50] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[16:44:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[16:44:21] <wikibugs>	 (03PS1) 10CDanis: otelcol: fix service name munging post-Envoy upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187036 (https://phabricator.wikimedia.org/T380211)
[16:45:03] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[16:45:15] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[16:46:13] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] otelcol: fix service name munging post-Envoy upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187036 (https://phabricator.wikimedia.org/T380211) (owner: 10CDanis)
[16:46:13] <wikibugs>	 (03PS8) 10Cathal Mooney: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577)
[16:46:43] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[16:46:44] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm
[16:46:56] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[16:47:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[16:48:48] <wikibugs>	 (03PS9) 10Cathal Mooney: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577)
[16:48:49] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[16:48:53] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[16:49:00] <wikibugs>	 (03CR) 10CDanis: [C:03+2] otelcol: fix service name munging post-Envoy upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187036 (https://phabricator.wikimedia.org/T380211) (owner: 10CDanis)
[16:50:28] <wikibugs>	 (03CR) 10Bking: [C:03+1] Temporarily exlude the 4 new hadoop workers to facilitate vlan change [puppet] - 10https://gerrit.wikimedia.org/r/1187029 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[16:50:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11168675 (10Jclark-ctr) @elukey I continued to get errors no root file system is defined when trying to boot from uefi
[16:54:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[16:54:24] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2005-dev.codfw.wmnet with OS bookworm
[16:54:37] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P83173 and previous config saved to /var/cache/conftool/dbconfig/20250910-165436-ladsgroup.json
[16:55:06] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[16:55:15] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[16:55:20] <wikibugs>	 (03PS10) 10Cathal Mooney: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577)
[16:55:32] <swfrench-wmf>	 !log started single-replica PHP 8.3 pilot on shellbox-constraints - T403284
[16:55:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:36] <stashbot>	 T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284
[16:56:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[16:57:47] <wikibugs>	 (03Merged) 10jenkins-bot: otelcol: fix service name munging post-Envoy upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187036 (https://phabricator.wikimedia.org/T380211) (owner: 10CDanis)
[16:58:39] <logmsgbot>	 !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[16:59:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11168738 (10bking) @BTullis @Jclark-ctr   Per @elukey 's comment, I'd also like to express my preference for using UEFI-DC Ops and IF are working to support the SuperMicro platform exclusively o...
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1700)
[17:00:25] <wikibugs>	 (03PS11) 10Cathal Mooney: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577)
[17:00:26] <logmsgbot>	 !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[17:04:58] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:09:45] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T402763)', diff saved to https://phabricator.wikimedia.org/P83174 and previous config saved to /var/cache/conftool/dbconfig/20250910-170944-ladsgroup.json
[17:09:48] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:09:49] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[17:10:00] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2199.codfw.wmnet with reason: Maintenance
[17:10:11] <wikibugs>	 (03PS2) 10Anzx: Lift IP cap for workshop at University of Pretoria on 29-30 September [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187033 (https://phabricator.wikimedia.org/T404218)
[17:10:13] <logmsgbot>	 !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[17:10:20] <logmsgbot>	 !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[17:10:23] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1186609 (owner: 10Dzahn)
[17:11:26] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage
[17:12:45] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] scap::master: Update advise in /srv/patches git pre-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[17:13:30] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[17:13:39] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Temporarily exlude the 4 new hadoop workers to facilitate vlan change [puppet] - 10https://gerrit.wikimedia.org/r/1187029 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[17:15:52] <Amir1>	 !log dropping user_autocreate_serial on sul wikis where empty (T397367)
[17:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:56] <stashbot>	 T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367
[17:19:22] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage
[17:23:01] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker[1233-1236].eqiad.wmnet
[17:24:28] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11168866 (10KFrancis) Hi all, I am confirming the NDA is complete.  Thanks!
[17:26:42] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to <wmde and nda>for <mahmoud-abdelsattar> - https://phabricator.wikimedia.org/T403695#11168885 (10KFrancis) Hi all, I am confirming the NDA is complete.  Thanks!
[17:28:09] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2206.codfw.wmnet with reason: Maintenance
[17:28:17] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2206 (T402763)', diff saved to https://phabricator.wikimedia.org/P83175 and previous config saved to /var/cache/conftool/dbconfig/20250910-172817-ladsgroup.json
[17:28:23] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[17:33:58] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[17:34:11] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db2185.codfw.wmnet
[17:34:54] <wikibugs>	 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11168927 (10RLazarus) And by request from @CDanis, adding to this config update cycle:  If `mesh.tracing.service_name` is set in values, pass it through in the `service_name` field of the `OpenTelemetryC...
[17:35:20] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[17:36:22] <wikibugs>	 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11168928 (10RLazarus)
[17:36:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frmx2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403970#11168929 (10Papaul) 05Open→03Resolved Done on the switch side
[17:37:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403965#11168933 (10Papaul) Done on the switch side
[17:37:25] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet with OS bookworm
[17:37:37] <wikibugs>	 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11168934 (10CDanis) Oh, one other thing we might want to do, if mesh.tracing.service_name is unset, default it to `.Release.Namespace`.  This is effectively what happens anyway in the big mess between En...
[17:39:58] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2185.codfw.wmnet
[17:40:47] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1256.eqiad.wmnet
[17:41:05] <logmsgbot>	 btullis@cumin1003 decommission (PID 1983368) is awaiting input
[17:41:07] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1256 - Upgrading db1256.eqiad.wmnet
[17:41:34] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1256 - Upgrading db1256.eqiad.wmnet
[17:41:40] <wikibugs>	 (03PS1) 10CDanis: turnilo: re-add summed-up TTFB measure [puppet] - 10https://gerrit.wikimedia.org/r/1187048
[17:41:47] <wikibugs>	 (03Abandoned) 10Bking: WIP: wdqs: Add alerts for no lag metrics reported [alerts] - 10https://gerrit.wikimedia.org/r/1130730 (https://phabricator.wikimedia.org/T389859) (owner: 10Bking)
[17:45:15] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2006-dev.codfw.wmnet with OS bookworm
[17:45:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:45:58] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1233-1236].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003"
[17:47:11] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1256.eqiad.wmnet
[17:47:32] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1256* gradually with 4 steps - Work done
[17:47:34] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[17:48:40] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[17:48:58] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[17:48:58] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:49:02] <logmsgbot>	 btullis@cumin1003 decommission (PID 1983368) is awaiting input
[17:49:04] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T402763)', diff saved to https://phabricator.wikimedia.org/P83178 and previous config saved to /var/cache/conftool/dbconfig/20250910-174903-ladsgroup.json
[17:49:08] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[17:49:25] <wikibugs>	 (03Merged) 10jenkins-bot: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[17:49:31] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[17:49:50] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[17:51:38] <wikibugs>	 (03PS1) 10Jdlrobson: Enable search recommendation on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187052 (https://phabricator.wikimedia.org/T402048)
[17:52:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable search recommendation on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187052 (https://phabricator.wikimedia.org/T402048) (owner: 10Jdlrobson)
[17:52:56] <wikibugs>	 (03PS1) 10Scott French: proton: persist increased replica count [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187053 (https://phabricator.wikimedia.org/T400131)
[17:53:11] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1233-1236].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003"
[17:53:11] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:53:12] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker[1233-1236].eqiad.wmnet
[17:53:58] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: sync
[17:54:06] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync
[17:56:19] <wikibugs>	 (03CR) 10Scott French: [C:03+2] proton: persist increased replica count [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187053 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French)
[17:57:58] <wikibugs>	 (03Merged) 10jenkins-bot: proton: persist increased replica count [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187053 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French)
[18:00:05] <jouncebot>	 dduvall and dancy: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1800)
[18:00:19] <dancy>	 o/
[18:00:23] <dancy>	 Loitering.
[18:01:04] <wikibugs>	 (03PS1) 10Dzahn: zuul::executor: systctl setting unprivileged_userns_clone needed [puppet] - 10https://gerrit.wikimedia.org/r/1187055 (https://phabricator.wikimedia.org/T403847)
[18:01:59] <dduvall>	 dancy: o/ rolling in a sec
[18:02:38] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187056 (https://phabricator.wikimedia.org/T396379)
[18:02:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187056 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot)
[18:02:43] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage
[18:03:39] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187056 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot)
[18:04:06] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[18:04:11] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P83180 and previous config saved to /var/cache/conftool/dbconfig/20250910-180411-ladsgroup.json
[18:06:14] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[18:08:24] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage
[18:09:36] <wikibugs>	 (03CR) 10Scott French: [C:03+2] rest-gateway: Introduce rest-gateway-ro [puppet] - 10https://gerrit.wikimedia.org/r/1182852 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert)
[18:10:07] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moving 4 servers to the analytics vlan - btullis@cumin1003"
[18:10:11] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moving 4 servers to the analytics vlan - btullis@cumin1003"
[18:10:11] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:10:58] <wikibugs>	 06SRE, 10bacula, 10Data-Persistence-Backup, 10Infrastructure Security, and 3 others: Trixie bacula-fd package incompatible with our bacula installation - https://phabricator.wikimedia.org/T404114#11169125 (10Dzahn) @jcrespo I tested a restore on people1005. Just selected 3 image files from my own home dir...
[18:11:51] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1233
[18:13:15] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1233
[18:15:47] <logmsgbot>	 !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.18  refs T396379
[18:15:52] <stashbot>	 T396379: 1.45.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T396379
[18:16:02] <swfrench-wmf>	 !log running puppet agent on A:dnsbox hosts - T400131
[18:16:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:06] <stashbot>	 T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131
[18:16:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:16:32] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:18:16] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1234
[18:19:19] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P83182 and previous config saved to /var/cache/conftool/dbconfig/20250910-181918-ladsgroup.json
[18:19:46] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1234
[18:19:50] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1235
[18:20:07] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1235
[18:20:13] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1236
[18:20:23] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1236
[18:21:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:25:17] <wikibugs>	 (03PS2) 10Jdlrobson: Enable search recommendation on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187052 (https://phabricator.wikimedia.org/T402048)
[18:26:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable search recommendation on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187052 (https://phabricator.wikimedia.org/T402048) (owner: 10Jdlrobson)
[18:26:37] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2006-dev.codfw.wmnet with OS bookworm
[18:27:53] <wikibugs>	 (03PS3) 10Jdlrobson: Enable search recommendation on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187052 (https://phabricator.wikimedia.org/T402048)
[18:28:11] <dduvall>	 !log elevated error rate during wmf.18 group1 promotion. all were `$aspect must use one of the XXX_USAGE constants` error occurring from wmf.17 (cc T404238)
[18:28:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:16] <stashbot>	 T404238: InvalidArgumentException: $aspect must use one of the XXX_USAGE constants, "A" given! - https://phabricator.wikimedia.org/T404238
[18:28:52] <logmsgbot>	 !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=rest-gateway-ro,name=eqiad [reason: Pooling eqiad on new -ro service - T400131]
[18:28:56] <stashbot>	 T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131
[18:28:58] <jinxer-wm>	 FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:29:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:30:14] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM thanks!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1186985 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi)
[18:33:10] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1256* gradually with 4 steps - Work done
[18:33:31] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187062 (https://phabricator.wikimedia.org/T403519)
[18:33:39] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187063 (https://phabricator.wikimedia.org/T403519)
[18:33:46] <wikibugs>	 (03CR) 10Scott French: [C:03+2] wmnet: Introduce rest-gateway-ro [dns] - 10https://gerrit.wikimedia.org/r/1182853 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert)
[18:33:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187062 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński)
[18:34:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187063 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński)
[18:34:26] <logmsgbot>	 !log swfrench@dns1004 START - running authdns-update
[18:34:26] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T402763)', diff saved to https://phabricator.wikimedia.org/P83184 and previous config saved to /var/cache/conftool/dbconfig/20250910-183426-ladsgroup.json
[18:34:32] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[18:34:42] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2210.codfw.wmnet with reason: Maintenance
[18:34:49] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2210 (T402763)', diff saved to https://phabricator.wikimedia.org/P83185 and previous config saved to /var/cache/conftool/dbconfig/20250910-183449-ladsgroup.json
[18:35:36] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11169332 (10vaughnwalters)
[18:35:47] <logmsgbot>	 !log swfrench@dns1004 END - running authdns-update
[18:35:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: Updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11169333 (10cmooney) >>! In T404146#11166358, @ayounsi wrote: > We already (and lazily) do :  `min_value=0, max_value=48` which was to accommodate both SON...
[18:36:28] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324)
[18:36:56] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324)
[18:37:04] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324)
[18:37:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[18:37:10] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324)
[18:37:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[18:37:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[18:38:40] <wikibugs>	 (03CR) 10Lucas Werkmeister: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[18:39:01] <swfrench-wmf>	 !log ran authdns-update to add rest-gateway-ro and point rest-gateway at it - T400131
[18:39:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:04] <stashbot>	 T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131
[18:40:33] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324)
[18:40:39] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324)
[18:40:42] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324)
[18:41:43] <wikibugs>	 (03CR) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[18:42:54] <wikibugs>	 (03CR) 10Lucas Werkmeister: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[18:42:58] <wikibugs>	 (03CR) 10Lucas Werkmeister: [C:03+1] Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[18:43:02] <wikibugs>	 (03CR) 10Lucas Werkmeister: [C:03+1] Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[18:43:34] <swfrench-wmf>	 !log temporarily disabling puppet agent on A:dnsbox hosts - T400131
[18:43:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:12] <wikibugs>	 (03CR) 10Scott French: [C:03+2] rest-gateway: Switch rest-gateway to A/P [puppet] - 10https://gerrit.wikimedia.org/r/1183084 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert)
[18:44:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11169380 (10Jclark-ctr) @bking If you could update the Partman for EFI booting — it was originally set up for Legacy. I had requested the change to EFI booting, but it was failing, possibly due...
[18:50:57] <swfrench-wmf>	 !log running puppet agent on A:dnsbox hosts - T400131
[18:51:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:02] <stashbot>	 T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131
[18:51:03] <sukhe>	 \m/
[18:51:06] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:56:12] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T402763)', diff saved to https://phabricator.wikimedia.org/P83186 and previous config saved to /var/cache/conftool/dbconfig/20250910-185611-ladsgroup.json
[18:56:16] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[18:58:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Renumber link addressing to lsw1-e1-eqiad and lsw1-f1-eqiad - https://phabricator.wikimedia.org/T404248 (10cmooney) 03NEW p:05Triage→03Low
[19:01:56] <wikibugs>	 (03PS1) 10Andrew Bogott: eqiad cloudceph: move one osd and one mon node to version 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187070 (https://phabricator.wikimedia.org/T404249)
[19:02:59] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187070 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott)
[19:03:35] <wikibugs>	 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11169471 (10Andrew) 05Open→03Resolved Everything is now on Quincy + Reef.
[19:06:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] eqiad cloudceph: move one osd and one mon node to version 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187070 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott)
[19:08:58] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[19:10:49] <wikibugs>	 (03PS1) 10Dzahn: zuul: use variables to set path to zookeeper TLS certs in config [puppet] - 10https://gerrit.wikimedia.org/r/1187076 (https://phabricator.wikimedia.org/T403847)
[19:11:20] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P83187 and previous config saved to /var/cache/conftool/dbconfig/20250910-191119-ladsgroup.json
[19:11:55] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bookworm
[19:12:23] <wikibugs>	 (03PS2) 10Dzahn: zuul: use variables to set path to zookeeper TLS certs in config [puppet] - 10https://gerrit.wikimedia.org/r/1187076 (https://phabricator.wikimedia.org/T403847)
[19:13:42] <wikibugs>	 (03PS3) 10Dzahn: zuul: use variables to set path to zookeeper TLS certs in config [puppet] - 10https://gerrit.wikimedia.org/r/1187076 (https://phabricator.wikimedia.org/T403847)
[19:13:47] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: use variables to set path to zookeeper TLS certs in config [puppet] - 10https://gerrit.wikimedia.org/r/1187076 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn)
[19:15:41] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251)
[19:16:52] <wikibugs>	 (03PS1) 10CDobbins: admin: add mahmoud-abdelsattar to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187080 (https://phabricator.wikimedia.org/T403695)
[19:17:34] <wikibugs>	 (03PS2) 10Kosta Harlan: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251)
[19:20:04] <musikanimal>	 https://spiderpig.wikimedia.org/jobs/541 <-- with the train promoted to group1, am I safe to deploy something right now?
[19:21:07] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[19:22:31] <wikibugs>	 (03CR) 10Dzahn: admin: add mahmoud-abdelsattar to ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187080 (https://phabricator.wikimedia.org/T403695) (owner: 10CDobbins)
[19:24:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "fine on executor nodes but needs follow-up on main nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1187076 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn)
[19:25:07] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate new snippet files for reverse dns zones added for ssw1-d1-eqiad links - cmooney@cumin1003"
[19:25:08] <wikibugs>	 (03PS1) 10Cathal Mooney: Include statements for new netbox-generated snippet files [dns] - 10https://gerrit.wikimedia.org/r/1187081 (https://phabricator.wikimedia.org/T402588)
[19:26:27] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P83188 and previous config saved to /var/cache/conftool/dbconfig/20250910-192626-ladsgroup.json
[19:27:47] <Reedy>	 jouncebot: nowandnext
[19:27:47] <jouncebot>	 For the next 0 hour(s) and 32 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1800)
[19:27:47] <jouncebot>	 In 0 hour(s) and 32 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T2000)
[19:28:11] <logmsgbot>	 cmooney@cumin1003 netbox (PID 1994354) is awaiting input
[19:28:42] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate new snippet files for reverse dns zones added for ssw1-d1-eqiad links - cmooney@cumin1003"
[19:28:42] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:30:20] <wikibugs>	 (03PS1) 10Dzahn: zuul::main: need to set TLS cert path variables before using them [puppet] - 10https://gerrit.wikimedia.org/r/1187083 (https://phabricator.wikimedia.org/T401614)
[19:30:45] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Include statements for new netbox-generated snippet files [dns] - 10https://gerrit.wikimedia.org/r/1187081 (https://phabricator.wikimedia.org/T402588) (owner: 10Cathal Mooney)
[19:31:07] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul::main: need to set TLS cert path variables before using them [puppet] - 10https://gerrit.wikimedia.org/r/1187083 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn)
[19:31:36] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Include statements for new netbox-generated snippet files [dns] - 10https://gerrit.wikimedia.org/r/1187081 (https://phabricator.wikimedia.org/T402588) (owner: 10Cathal Mooney)
[19:32:07] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[19:33:24] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185156 (https://phabricator.wikimedia.org/T403781) (owner: 10Umherirrender)
[19:33:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185156 (https://phabricator.wikimedia.org/T403781) (owner: 10Umherirrender)
[19:33:34] <logmsgbot>	 !log cmooney@dns2005 END - running authdns-update
[19:41:38] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T402763)', diff saved to https://phabricator.wikimedia.org/P83189 and previous config saved to /var/cache/conftool/dbconfig/20250910-194134-ladsgroup.json
[19:41:43] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[19:41:53] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2219.codfw.wmnet with reason: Maintenance
[19:41:56] <wikibugs>	 (03PS1) 10Reedy: HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187087 (https://phabricator.wikimedia.org/T404252)
[19:42:01] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2219 (T402763)', diff saved to https://phabricator.wikimedia.org/P83190 and previous config saved to /var/cache/conftool/dbconfig/20250910-194200-ladsgroup.json
[19:42:03] <wikibugs>	 (03CR) 10Reedy: [C:03+2] HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187087 (https://phabricator.wikimedia.org/T404252) (owner: 10Reedy)
[19:42:10] <wikibugs>	 (03PS1) 10Reedy: HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187088 (https://phabricator.wikimedia.org/T404252)
[19:42:16] <wikibugs>	 (03CR) 10Reedy: [C:03+2] HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187088 (https://phabricator.wikimedia.org/T404252) (owner: 10Reedy)
[19:42:20] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage
[19:44:15] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[19:44:15] <logmsgbot>	 !log cmooney@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[19:44:32] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[19:48:56] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live (is_pageview is always "-") - https://phabricator.wikimedia.org/T402612#11169688 (10JAllemandou) Building a JAR containing the `is_pageview` logic with a few dependencies as possible is easy. My little researc...
[19:49:15] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage
[19:50:15] <logmsgbot>	 cmooney@cumin1003 netbox (PID 1996852) is awaiting input
[19:51:25] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns for new IPs for ssw1-d1-eqiad - cmooney@cumin1003"
[19:52:07] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns for new IPs for ssw1-d1-eqiad - cmooney@cumin1003"
[19:52:07] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:53:53] <wikibugs>	 (03Merged) 10jenkins-bot: HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187087 (https://phabricator.wikimedia.org/T404252) (owner: 10Reedy)
[19:55:00] <wikibugs>	 (03PS4) 10NMW03: Add rights to bypass spam blacklists for azwiki sysops and interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428)
[19:55:39] <Nemoralis>	 jouncebot: next
[19:55:39] <jouncebot>	 In 0 hour(s) and 4 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T2000)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T2000).
[20:00:05] <jouncebot>	 Nemoralis and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:13] <Nemoralis>	 o/
[20:00:25] <MatmaRex>	 hi
[20:00:29] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks for highlighting the source of the policy change adding the port." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[20:00:32] <wikibugs>	 (03Merged) 10jenkins-bot: HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187088 (https://phabricator.wikimedia.org/T404252) (owner: 10Reedy)
[20:01:04] <Reedy>	 Who is doing the deployment? Or have I put myself in the position that it's me? :P
[20:01:25] <Nemoralis>	 not me of course ;P
[20:01:37] <MatmaRex>	 heh. looks like it's you Reedy, thanks ;)
[20:01:40] <wikibugs>	 (03CR) 10Reedy: [C:03+2] ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187062 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński)
[20:01:41] <wikibugs>	 (03CR) 10Reedy: [C:03+2] ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187063 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński)
[20:02:08] <wikibugs>	 (03CR) 10Reedy: [C:03+2] build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185156 (https://phabricator.wikimedia.org/T403781) (owner: 10Umherirrender)
[20:02:36] <Reedy>	 MatmaRex: for your session handling changes, do you need to test those and/or is there a reasonable chance of needing to revert? :P
[20:02:53] <Reedy>	 I think I'll probably do the rest in one go, and then do those two after...
[20:03:19] <wikibugs>	 (03Merged) 10jenkins-bot: build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185156 (https://phabricator.wikimedia.org/T403781) (owner: 10Umherirrender)
[20:03:36] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Add rights to bypass spam blacklists for azwiki sysops and interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) (owner: 10NMW03)
[20:04:11] <MatmaRex>	 Reedy: the backports are not risky. the config changes depend on the backport and are a little bit more risky. i can test once the config changes are on mwdebug.
[20:04:41] <Reedy>	 Do you want those config ones seperately? Or do the lot in one go?
[20:04:45] <Reedy>	 seperately/after
[20:05:37] <wikibugs>	 (03Merged) 10jenkins-bot: Add rights to bypass spam blacklists for azwiki sysops and interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) (owner: 10NMW03)
[20:05:41] <MatmaRex>	 Reedy: if you can do the backports and session config at the same time, that would be fine. we should probably do Nemoralis's config and the codesniffer one separately first
[20:06:13] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T402763)', diff saved to https://phabricator.wikimedia.org/P83192 and previous config saved to /var/cache/conftool/dbconfig/20250910-200612-ladsgroup.json
[20:06:18] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[20:07:09] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1187087|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1187088|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1175222|Add rights to bypass spam blacklists for azwiki sysops and interface-admins (T400428)]], [[gerrit:1185156|build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 (T403781)]]
[20:07:18] <stashbot>	 T404252: OATHAuth loading more from the database than needed - https://phabricator.wikimedia.org/T404252
[20:07:18] <stashbot>	 T400428: Addition of "sboverride" and "abusefilter-bypass-blocked-external-domains" rights for azwiki sysops, interface admins and bureaucrats - https://phabricator.wikimedia.org/T400428
[20:07:19] <stashbot>	 T403781: MediaWiki.NamingConventions.ValidGlobalName: Stop accepting a comma-separated string for values, deprecated upstream - https://phabricator.wikimedia.org/T403781
[20:08:26] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1004.eqiad.wmnet with OS bookworm
[20:10:43] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1016.eqiad.wmnet with OS bookworm
[20:11:11] <wikibugs>	 (03CR) 10Scott French: [C:03+1] {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186676 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus)
[20:12:33] <wikibugs>	 (03PS2) 10CDobbins: admin: add mahmoud-abdelsattar to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187080 (https://phabricator.wikimedia.org/T403695)
[20:13:32] <logmsgbot>	 !log reedy@deploy1003 reedy, umherirrender, nmw03: Backport for [[gerrit:1187087|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1187088|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1175222|Add rights to bypass spam blacklists for azwiki sysops and interface-admins (T400428)]], [[gerrit:1185156|build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 (T403781)]] synced to the testse
[20:13:32] <logmsgbot>	 rvers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:13:39] <Reedy>	 Nemoralis: ^ Do you want to test? I'm fine if you don't want to
[20:13:39] <stashbot>	 T404252: OATHAuth loading more from the database than needed - https://phabricator.wikimedia.org/T404252
[20:13:39] <stashbot>	 T400428: Addition of "sboverride" and "abusefilter-bypass-blocked-external-domains" rights for azwiki sysops, interface admins and bureaucrats - https://phabricator.wikimedia.org/T400428
[20:13:40] <stashbot>	 T403781: MediaWiki.NamingConventions.ValidGlobalName: Stop accepting a comma-separated string for values, deprecated upstream - https://phabricator.wikimedia.org/T403781
[20:14:16] <Nemoralis>	 Reedy: tested, LGTM
[20:14:39] <logmsgbot>	 !log reedy@deploy1003 reedy, umherirrender, nmw03: Continuing with sync
[20:15:02] <wikibugs>	 (03PS1) 10Cathal Mooney: Nokia: EBGP configuration base build [homer/public] - 10https://gerrit.wikimedia.org/r/1187092 (https://phabricator.wikimedia.org/T402577)
[20:16:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Nokia: EBGP configuration base build [homer/public] - 10https://gerrit.wikimedia.org/r/1187092 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[20:16:34] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] "Thanks! Verified that none of these charts have been bumped at master in the meantime, so this is conflict-free." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[20:16:39] <wikibugs>	 (03CR) 10CDobbins: admin: add mahmoud-abdelsattar to ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187080 (https://phabricator.wikimedia.org/T403695) (owner: 10CDobbins)
[20:17:15] <wikibugs>	 (03Merged) 10jenkins-bot: ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187062 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński)
[20:17:50] <wikibugs>	 (03PS2) 10Cathal Mooney: Nokia: EBGP configuration base build [homer/public] - 10https://gerrit.wikimedia.org/r/1187092 (https://phabricator.wikimedia.org/T402577)
[20:19:02] <wikibugs>	 (03Merged) 10jenkins-bot: ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187063 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński)
[20:19:58] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187087|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1187088|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1175222|Add rights to bypass spam blacklists for azwiki sysops and interface-admins (T400428)]], [[gerrit:1185156|build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 (T403781)]] (duration: 12m 48s)
[20:19:58] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[20:19:59] <wikibugs>	 (03CR) 10Scott French: [C:03+2] wmnet: Switch rest-gateway to metafo [dns] - 10https://gerrit.wikimedia.org/r/1183085 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert)
[20:20:00] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[20:20:05] <stashbot>	 T404252: OATHAuth loading more from the database than needed - https://phabricator.wikimedia.org/T404252
[20:20:05] <stashbot>	 T400428: Addition of "sboverride" and "abusefilter-bypass-blocked-external-domains" rights for azwiki sysops, interface admins and bureaucrats - https://phabricator.wikimedia.org/T400428
[20:20:06] <stashbot>	 T403781: MediaWiki.NamingConventions.ValidGlobalName: Stop accepting a comma-separated string for values, deprecated upstream - https://phabricator.wikimedia.org/T403781
[20:20:55] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[20:20:55] <wikibugs>	 (03PS2) 10Clément Goubert: wmnet: Switch rest-gateway to metafo [dns] - 10https://gerrit.wikimedia.org/r/1183085 (https://phabricator.wikimedia.org/T402412)
[20:20:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[20:20:59] <lucaswerkmeister>	 (I’m just here so I can shout complaints if the session changes break my tools again ;P)
[20:21:19] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P83193 and previous config saved to /var/cache/conftool/dbconfig/20250910-202119-ladsgroup.json
[20:21:28] <MatmaRex>	 hi lucaswerkmeister, thanks :)
[20:21:45] <wikibugs>	 (03CR) 10Scott French: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1183085 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert)
[20:23:15] <wikibugs>	 (03CR) 10Cathal Mooney: Nokia: EBGP configuration base build (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1187092 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[20:23:30] <logmsgbot>	 !log swfrench@dns1004 START - running authdns-update
[20:24:06] <Reedy>	 dancy: About?
[20:24:15] <dancy>	 Hola
[20:24:23] <Reedy>	 scap is doing something weird
[20:24:28] <Reedy>	 and macos terminal just crashed again
[20:24:45] <logmsgbot>	 !log swfrench@dns1004 END - running authdns-update
[20:25:01] <Reedy>	 reedy@deploy1003:/srv/mediawiki-staging$ scap backport 1187062 1187063 1187066 1187065
[20:25:01] <Reedy>	 20:23:37 Checking whether requested changes are in a branch deployed to production and their dependencies valid...
[20:25:01] <Reedy>	 20:23:41 Change '1187062' validated for backport
[20:25:01] <Reedy>	 20:23:44 Change '1187063' validated for backport
[20:25:01] <Reedy>	 Change '1186595', project 'mediawiki/core', branch 'master' not found in any deployed wikiversion. Deployed wikiversions: ['1.45.0-wmf.17', '1.45.0-wmf.18']
[20:25:02] <Reedy>	 Continue with backport? [y/N]: 
[20:25:08] <Reedy>	 Why is it trying to do 1186595?
[20:25:14] <swfrench-wmf>	 !log ran authdns-update to convert rest-gateway to active/passive - T400131
[20:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:17] <stashbot>	 T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131
[20:25:19] <MatmaRex>	 hmm
[20:25:19] <Reedy>	 Depends-On?
[20:25:26] <dancy>	 Is must be a Depends-On in one of the primary commits
[20:25:29] <MatmaRex>	 probably because lucaswerkmeister made me add Depends-On ;)
[20:25:40] <Reedy>	 Now we can't deploy because scap says no :P
[20:25:48] <dancy>	 You can answer yes
[20:26:04] <Reedy>	 Do you know if there's a bug about this, or shall I file one?
[20:26:11] <logmsgbot>	 !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=rest-gateway-ro,name=eqiad [reason: Pooling codfw on new -ro service - T400131]
[20:26:13] <dancy>	 I believe there is one. I'll look it up
[20:26:15] <Reedy>	 As all three commits with the change-id are merged, so it shouldn't really care
[20:26:16] <Reedy>	 thanks
[20:26:32] <lucaswerkmeister>	 the bug would presumably be T388025
[20:26:32] <stashbot>	 T388025: scap complaining about dependency which is already merged - https://phabricator.wikimedia.org/T388025
[20:26:36] <lucaswerkmeister>	 cc dancy
[20:26:44] <logmsgbot>	 !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=rest-gateway-ro,name=codfw [reason: Pooling codfw on new -ro service - T400131]
[20:26:45] <dancy>	 That's the one.
[20:26:54] <lucaswerkmeister>	 I had only seen the opposite of it, T397931, which is why I asked to add the Depends-On
[20:26:55] <stashbot>	 T397931: scap not complaining about dependencies only partially deployed with the train - https://phabricator.wikimedia.org/T397931
[20:26:58] <lucaswerkmeister>	 good to know it’s broken both ways
[20:26:58] <MatmaRex>	 yeah, the changes to deploy are correct
[20:27:00] <wikibugs>	 (03Merged) 10jenkins-bot: all charts: Update mesh.configuration 1.13.0 to 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[20:27:06] <Reedy>	 heh
[20:27:24] <Reedy>	 dippy bird to press Y
[20:27:35] <MatmaRex>	 well, good to know for sure that we shouldn't use Depends-On in operations/mediawiki-config
[20:27:38] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1187062|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187063|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187066|Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324)]], [[gerrit:1187065|Revert^2 "Set $wgPHPSessionHandling to 'disable'
[20:27:38] <logmsgbot>	 on group0 wikis" (T362324)]]
[20:27:43] <stashbot>	 T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519
[20:27:44] <stashbot>	 T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324
[20:29:17] <wikibugs>	 (03PS1) 10CDobbins: admin: add johannesrichterwmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187094 (https://phabricator.wikimedia.org/T404080)
[20:32:07] <logmsgbot>	 !log reedy@deploy1003 reedy, matmarex: Backport for [[gerrit:1187062|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187063|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187066|Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324)]], [[gerrit:1187065|Revert^2 "Set $wgPHPSessionHandling to 'disable' on grou
[20:32:07] <logmsgbot>	 p0 wikis" (T362324)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:32:20] <Reedy>	 MatmaRex: lucaswerkmeister ^ have at it
[20:33:47] <MatmaRex>	 oh, i just realized i can't make an oauth app use the test servers
[20:34:06] <MatmaRex>	 so… i guess i'll test once it's live? i was using https://oauth-hello-world.toolforge.org/
[20:34:30] <Reedy>	 heh, want me to just ship it then?
[20:35:10] <MatmaRex>	 normal logins and edits work fine
[20:35:12] <MatmaRex>	 yeahhh
[20:36:08] <lucaswerkmeister>	 I mean, you could probably make it send the right X-Wikimedia-Debug headers somewhere in the innards of the code
[20:36:11] <lucaswerkmeister>	 but yeah just syncing sounds okay too
[20:36:27] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P83194 and previous config saved to /var/cache/conftool/dbconfig/20250910-203626-ladsgroup.json
[20:36:34] <logmsgbot>	 !log reedy@deploy1003 reedy, matmarex: Continuing with sync
[20:36:35] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: host reimage
[20:40:07] <wikibugs>	 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11169880 (10Catrope)
[20:40:28] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: host reimage
[20:41:47] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187062|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187063|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187066|Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324)]], [[gerrit:1187065|Revert^2 "Set $wgPHPSessionHandling to 'disable
[20:41:48] <logmsgbot>	 ' on group0 wikis" (T362324)]] (duration: 14m 09s)
[20:41:53] <stashbot>	 T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519
[20:41:53] <stashbot>	 T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324
[20:42:39] <MatmaRex>	 it works now :)
[20:42:42] <lucaswerkmeister>	 looks like edits are still working \o/ https://test.wikidata.org/w/index.php?title=Lexeme:L123&diff=prev&oldid=738333
[20:43:05] <MatmaRex>	 https://test.wikipedia.org/w/index.php?title=User_talk:Matma_Rex&diff=prev&oldid=673772
[20:43:11] <MatmaRex>	 thanks Reedy
[20:51:35] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T402763)', diff saved to https://phabricator.wikimedia.org/P83195 and previous config saved to /var/cache/conftool/dbconfig/20250910-205134-ladsgroup.json
[20:51:39] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2236.codfw.wmnet with reason: Maintenance
[20:51:40] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[20:51:47] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2236 (T402763)', diff saved to https://phabricator.wikimedia.org/P83196 and previous config saved to /var/cache/conftool/dbconfig/20250910-205146-ladsgroup.json
[20:52:48] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Set $wgPHPSessionHandling to 'disable' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (owner: 10Gergő Tisza)
[20:52:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Set $wgPHPSessionHandling to 'disable' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (owner: 10Gergő Tisza)
[20:53:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[20:53:30] <jynus>	 ^ expected due to news
[20:54:03] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: Set $wgPHPSessionHandling to 'disable' on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza)
[20:54:07] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[20:56:11] <perryprog>	 likely due to ne—frick
[20:57:09] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1016.eqiad.wmnet with OS bookworm
[20:58:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[20:59:07] <jinxer-wm>	 RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T2100)
[21:06:16] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186676 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus)
[21:07:57] <wikibugs>	 (03Merged) 10jenkins-bot: {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186676 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus)
[21:09:27] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/apertium: apply
[21:09:36] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/apertium: apply
[21:10:23] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/apertium: apply
[21:11:01] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/apertium: apply
[21:11:15] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/apertium: apply
[21:11:56] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/apertium: apply
[21:15:09] <wikibugs>	 (03PS1) 10SBassett: Optionally encrypt OTP secret in the database [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187101 (https://phabricator.wikimedia.org/T145915)
[21:16:16] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187101 (https://phabricator.wikimedia.org/T145915) (owner: 10SBassett)
[21:16:17] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:17:00] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T402763)', diff saved to https://phabricator.wikimedia.org/P83197 and previous config saved to /var/cache/conftool/dbconfig/20250910-211659-ladsgroup.json
[21:17:04] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[21:17:21] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[21:18:17] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[21:18:33] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[21:18:35] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[21:24:57] <wikibugs>	 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11170002 (10RLazarus)
[21:25:13] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:28:11] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply
[21:28:28] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[21:30:16] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[21:30:17] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[21:30:25] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Ah, that's an interesting use case, and agreed that the aggregated TTFB is meaningful in that context. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1187048 (owner: 10CDanis)
[21:32:06] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[21:32:07] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P83198 and previous config saved to /var/cache/conftool/dbconfig/20250910-213207-ladsgroup.json
[21:33:58] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[21:35:27] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:37:17] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:39:03] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Start AB test of did-you-mean profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187108 (https://phabricator.wikimedia.org/T390858)
[21:45:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:46:52] <wikibugs>	 (03PS1) 10Bking: opensearch-operator: point to correct image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187109 (https://phabricator.wikimedia.org/T397246)
[21:47:15] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P83199 and previous config saved to /var/cache/conftool/dbconfig/20250910-214714-ladsgroup.json
[21:48:58] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[21:53:47] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11170105 (10Jdlrobson-WMF)
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T2200)
[22:01:18] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd1016.eqiad.wmnet: update network names for Boookworm [puppet] - 10https://gerrit.wikimedia.org/r/1187111
[22:01:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1016.eqiad.wmnet: update network names for Boookworm [puppet] - 10https://gerrit.wikimedia.org/r/1187111 (owner: 10Andrew Bogott)
[22:02:22] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T402763)', diff saved to https://phabricator.wikimedia.org/P83200 and previous config saved to /var/cache/conftool/dbconfig/20250910-220222-ladsgroup.json
[22:02:27] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[22:02:38] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2237.codfw.wmnet with reason: Maintenance
[22:02:46] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2237 (T402763)', diff saved to https://phabricator.wikimedia.org/P83201 and previous config saved to /var/cache/conftool/dbconfig/20250910-220245-ladsgroup.json
[22:04:06] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:16:32] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:18:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] admin: add mahmoud-abdelsattar to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187080 (https://phabricator.wikimedia.org/T403695) (owner: 10CDobbins)
[22:26:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1187020/6884/people1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff)
[22:27:39] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T402763)', diff saved to https://phabricator.wikimedia.org/P83202 and previous config saved to /var/cache/conftool/dbconfig/20250910-222738-ladsgroup.json
[22:27:43] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[22:28:58] <jinxer-wm>	 FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:29:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:42:46] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P83203 and previous config saved to /var/cache/conftool/dbconfig/20250910-224246-ladsgroup.json
[22:51:06] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:55:48] <wikibugs>	 (03PS1) 10RLazarus: envoyproxy: Remove lua_script param [puppet] - 10https://gerrit.wikimedia.org/r/1187126 (https://phabricator.wikimedia.org/T404036)
[22:57:54] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P83204 and previous config saved to /var/cache/conftool/dbconfig/20250910-225753-ladsgroup.json
[23:01:01] <wikibugs>	 (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6885/console" [puppet] - 10https://gerrit.wikimedia.org/r/1187126 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[23:01:43] <wikibugs>	 (03CR) 10RLazarus: envoyproxy: Remove lua_script param [puppet] - 10https://gerrit.wikimedia.org/r/1187126 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[23:08:24] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Reboot', diff saved to https://phabricator.wikimedia.org/P83205 and previous config saved to /var/cache/conftool/dbconfig/20250910-230823-ladsgroup.json
[23:08:35] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1257.eqiad.wmnet
[23:08:54] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1257 - Upgrading db1257.eqiad.wmnet
[23:08:58] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[23:09:01] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1257 - Upgrading db1257.eqiad.wmnet
[23:13:01] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T402763)', diff saved to https://phabricator.wikimedia.org/P83206 and previous config saved to /var/cache/conftool/dbconfig/20250910-231301-ladsgroup.json
[23:13:06] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[23:13:17] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2239.codfw.wmnet with reason: Maintenance
[23:14:00] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1257.eqiad.wmnet
[23:15:03] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1257* gradually with 4 steps - Work done
[23:15:07] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:16:35] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:18:19] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1187131 (https://phabricator.wikimedia.org/T404274)
[23:18:24] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1187132 (https://phabricator.wikimedia.org/T404274)
[23:25:51] <rzl>	 !log sudo -i reprepro -C main includedeb bullseye-wikimedia /srv/wikimedia/pool/component/envoy-future/e/envoyproxy/envoyproxy_1.29.12-1_amd64.deb  # T403663
[23:25:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:56] <stashbot>	 T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663
[23:26:05] <rzl>	 !log sudo -i reprepro copy bookworm-wikimedia bullseye-wikimedia envoyproxy  # T403663
[23:26:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:26:12] <rzl>	 !log sudo -i reprepro copy trixie-wikimedia bullseye-wikimedia envoyproxy  # T403663
[23:26:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:28:32] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking), 10Wikipedia-iOS-App-Backlog (iOS Release FY2025-26): [QA Task] Verify iOS compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404275 (10Seddon) 03NEW
[23:28:44] <wikibugs>	 (03PS1) 10RLazarus: envoy: Update to v1.29.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187134 (https://phabricator.wikimedia.org/T403663)
[23:28:48] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1193 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1187135 (https://phabricator.wikimedia.org/T404277)
[23:28:53] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1187136 (https://phabricator.wikimedia.org/T404277)
[23:28:58] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): [QA Task] Verify Android app compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404276 (10Seddon) 03NEW
[23:31:05] <wikibugs>	 (03CR) 10RLazarus: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187134 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus)
[23:31:31] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 5.379 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:33:36] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T404277
[23:33:40] <stashbot>	 T404277: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T404277
[23:34:29] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set db1193 with weight 0 T404277', diff saved to https://phabricator.wikimedia.org/P83209 and previous config saved to /var/cache/conftool/dbconfig/20250910-233428-ladsgroup.json
[23:37:59] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db1193 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1187135 (https://phabricator.wikimedia.org/T404277) (owner: 10Gerrit maintenance bot)
[23:38:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1187141
[23:38:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1187141 (owner: 10TrainBranchBot)
[23:38:48] <Amir1>	 !log Starting s8 eqiad failover from db1209 to db1193 - T404277
[23:38:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:38:52] <stashbot>	 T404277: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T404277
[23:39:00] <wikibugs>	 (03CR) 10Scott French: [C:03+1] envoy: Update to v1.29.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187134 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus)
[23:39:03] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T404277', diff saved to https://phabricator.wikimedia.org/P83210 and previous config saved to /var/cache/conftool/dbconfig/20250910-233902-ladsgroup.json
[23:39:36] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2240.codfw.wmnet with reason: Maintenance
[23:39:44] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2240 (T402763)', diff saved to https://phabricator.wikimedia.org/P83211 and previous config saved to /var/cache/conftool/dbconfig/20250910-233943-ladsgroup.json
[23:39:50] <stashbot>	 T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763
[23:40:03] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.011 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:40:50] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote db1193 to s8 primary and set section read-write T404277', diff saved to https://phabricator.wikimedia.org/P83212 and previous config saved to /var/cache/conftool/dbconfig/20250910-234049-ladsgroup.json
[23:41:11] <wikibugs>	 (03CR) 10RLazarus: [V:03+2 C:03+2] envoy: Update to v1.29.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187134 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus)
[23:42:57] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1187136 (https://phabricator.wikimedia.org/T404277) (owner: 10Gerrit maintenance bot)
[23:43:14] <logmsgbot>	 !log ladsgroup@dns1004 START - running authdns-update
[23:44:24] <logmsgbot>	 !log ladsgroup@dns1004 END - running authdns-update
[23:44:56] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1209 T404277', diff saved to https://phabricator.wikimedia.org/P83213 and previous config saved to /var/cache/conftool/dbconfig/20250910-234456-ladsgroup.json
[23:45:00] <stashbot>	 T404277: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T404277
[23:46:35] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:47:03] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1209.eqiad.wmnet
[23:47:11] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1209 - Upgrading db1209.eqiad.wmnet
[23:47:18] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1209 - Upgrading db1209.eqiad.wmnet
[23:53:44] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1187141 (owner: 10TrainBranchBot)
[23:55:09] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:58:00] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1209.eqiad.wmnet
[23:58:53] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1209* gradually with 4 steps - Work done