[00:00:34] (03Merged) 10jenkins-bot: all charts: Update mesh.configuration 1.14.0 to 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186640 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:04:17] RESOLVED: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:08:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P83080 and previous config saved to /var/cache/conftool/dbconfig/20250910-000807-fceratto.json [00:08:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1186645 [00:08:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1186645 (owner: 10TrainBranchBot) [00:11:17] FIRING: ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:11:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T402763)', diff saved to https://phabricator.wikimedia.org/P83081 and previous config saved to /var/cache/conftool/dbconfig/20250910-001131-fceratto.json [00:11:36] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [00:11:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2198.codfw.wmnet with reason: Maintenance [00:13:34] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [00:14:11] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [00:15:14] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [00:16:17] FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:16:26] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [00:16:56] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [00:18:02] (03CR) 10RLazarus: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:18:05] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [00:21:17] RESOLVED: [4x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:22:17] FIRING: ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:23:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T402763)', diff saved to https://phabricator.wikimedia.org/P83082 and previous config saved to /var/cache/conftool/dbconfig/20250910-002315-fceratto.json [00:23:19] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [00:23:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2188.codfw.wmnet with reason: Maintenance [00:23:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T402763)', diff saved to https://phabricator.wikimedia.org/P83083 and previous config saved to /var/cache/conftool/dbconfig/20250910-002338-fceratto.json [00:27:17] RESOLVED: [3x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:29:32] (03PS1) 10Andrew Bogott: wmcs radosgw: use 'beast' http server rather than civetweb [puppet] - 10https://gerrit.wikimedia.org/r/1186647 [00:29:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T402763)', diff saved to https://phabricator.wikimedia.org/P83084 and previous config saved to /var/cache/conftool/dbconfig/20250910-002943-fceratto.json [00:29:48] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [00:31:02] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1186645 (owner: 10TrainBranchBot) [00:31:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186647 (owner: 10Andrew Bogott) [00:35:18] (03PS2) 10Andrew Bogott: wmcs radosgw: use 'beast' http server rather than civetweb [puppet] - 10https://gerrit.wikimedia.org/r/1186647 [00:39:02] (03PS1) 10Andrew Bogott: Ceph rbd: remove option to use 'civetweb' front-end [puppet] - 10https://gerrit.wikimedia.org/r/1186649 [00:39:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186647 (owner: 10Andrew Bogott) [00:39:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott) [00:39:38] (03CR) 10CI reject: [V:04-1] Ceph rbd: remove option to use 'civetweb' front-end [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott) [00:41:05] (03CR) 10RLazarus: "The gargantuan CI diff is correct, after some semiautomated review: it's all chart patch-version bumps, associated checksums, and in some " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:42:50] (03PS2) 10Andrew Bogott: Ceph rbd: remove option to use 'civetweb' front-end [puppet] - 10https://gerrit.wikimedia.org/r/1186649 [00:43:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott) [00:44:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P83085 and previous config saved to /var/cache/conftool/dbconfig/20250910-004451-fceratto.json [00:45:17] FIRING: ProbeDown: Service wdqs2015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:45:57] (03PS1) 10Sbisson: CX3 Build 1.0.0+20250909 [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886) [00:46:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886) (owner: 10Sbisson) [00:48:08] (03PS1) 10Sbisson: Desktop publish_success: add revid and pageid [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186651 (https://phabricator.wikimedia.org/T402975) [00:48:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186651 (https://phabricator.wikimedia.org/T402975) (owner: 10Sbisson) [00:50:17] RESOLVED: ProbeDown: Service wdqs2015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:51:17] FIRING: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:59:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P83086 and previous config saved to /var/cache/conftool/dbconfig/20250910-005958-fceratto.json [01:00:49] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:04:06] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:06:17] FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:07:49] (03CR) 10Andrew Bogott: [C:03+2] wmcs radosgw: use 'beast' http server rather than civetweb [puppet] - 10https://gerrit.wikimedia.org/r/1186647 (owner: 10Andrew Bogott) [01:11:17] FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:12:58] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 12m 08s) [01:15:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T402763)', diff saved to https://phabricator.wikimedia.org/P83087 and previous config saved to /var/cache/conftool/dbconfig/20250910-011506-fceratto.json [01:15:11] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [01:15:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2202.codfw.wmnet with reason: Maintenance [01:16:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:35] FIRING: DiskSpace: Disk space deploy1003:9100:/ 3.975% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:19:59] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2203.codfw.wmnet with reason: Maintenance [01:20:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T402763)', diff saved to https://phabricator.wikimedia.org/P83088 and previous config saved to /var/cache/conftool/dbconfig/20250910-012006-fceratto.json [01:25:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T402763)', diff saved to https://phabricator.wikimedia.org/P83089 and previous config saved to /var/cache/conftool/dbconfig/20250910-012533-fceratto.json [01:25:38] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [01:26:17] FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:36:17] FIRING: [5x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:36:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:41:17] FIRING: [5x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:45:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:11:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2216.codfw.wmnet with reason: Maintenance [02:11:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T402763)', diff saved to https://phabricator.wikimedia.org/P83090 and previous config saved to /var/cache/conftool/dbconfig/20250910-021116-fceratto.json [02:11:21] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [02:17:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T402763)', diff saved to https://phabricator.wikimedia.org/P83091 and previous config saved to /var/cache/conftool/dbconfig/20250910-021720-fceratto.json [02:17:25] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [02:26:17] FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:28:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:31:17] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:32:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P83092 and previous config saved to /var/cache/conftool/dbconfig/20250910-023228-fceratto.json [02:36:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:39:23] (03PS1) 10RLazarus: {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186676 (https://phabricator.wikimedia.org/T402584) [02:42:26] (03PS2) 10RLazarus: {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186676 (https://phabricator.wikimedia.org/T403663) [02:46:17] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:47:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P83093 and previous config saved to /var/cache/conftool/dbconfig/20250910-024735-fceratto.json [02:56:17] FIRING: [7x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:01:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:02:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T402763)', diff saved to https://phabricator.wikimedia.org/P83094 and previous config saved to /var/cache/conftool/dbconfig/20250910-030243-fceratto.json [03:02:48] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [03:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:56:20] (03PS1) 10Papaul: Remove OSFP from mr1-eqsin and cr2/3-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1186682 (https://phabricator.wikimedia.org/T294845) [04:26:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:31:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:31:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:36:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:04:06] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:08:58] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:19:50] FIRING: DiskSpace: Disk space deploy1003:9100:/ 3.39% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:33:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:40:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/2 (Transport: cr2-codfw:xe-0/1/1:1 (Lumen, 442550293) {#5249}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:45:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:45:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:46:49] !log rebalance ganeti03 in esams T402259 [05:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:53] T402259: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259 [05:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:51:30] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1186609 (owner: 10Dzahn) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T0600) [06:04:54] !log installing node-minipass security updates [06:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:11:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:16:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:28:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:40:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:40:46] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [06:40:51] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:44:32] (03Restored) 10Thiemo Kreuz (WMDE): Drop deprecated survey prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832464 (https://phabricator.wikimedia.org/T317862) (owner: 10Awight) [06:45:48] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 35.64 ms [06:50:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:50:50] (03CR) 10Elukey: [C:03+1] maps: Move the setting for planet_sync_hours to the common role setting [puppet] - 10https://gerrit.wikimedia.org/r/1186533 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [06:50:51] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:06:06] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/device-analytics: sync [07:06:21] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/device-analytics: sync [07:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:11:27] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/edit-analytics: sync [07:11:41] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/edit-analytics: sync [07:11:51] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: sync [07:12:06] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: sync [07:12:46] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/geo-analytics: sync [07:13:01] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/geo-analytics: sync [07:13:24] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/image-suggestion: sync [07:13:35] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/image-suggestion: sync [07:13:55] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: sync [07:14:10] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: sync [07:14:32] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: sync [07:14:47] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: sync [07:19:40] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11166132 (10elukey) Ack! Upgraded staging, and pinged the DSE SREs as well on slack to gather their opinion about ownership etc.. [07:30:06] (03CR) 10Ayounsi: [C:03+1] Remove OSFP from mr1-eqsin and cr2/3-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1186682 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [07:31:21] (03PS2) 10Bartosz Wójtowicz: statistics: Update model upload script to check for correct boto3 version. [puppet] - 10https://gerrit.wikimedia.org/r/1180823 (https://phabricator.wikimedia.org/T394301) [07:35:18] (03CR) 10Elukey: [C:03+2] "nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/1180823 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [07:41:24] (03CR) 10Muehlenhoff: [C:03+2] maps: Remove disable_tile_generation_timer [puppet] - 10https://gerrit.wikimedia.org/r/1186523 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:41:29] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1009.eqiad.wmnet [07:41:30] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1009.eqiad.wmnet [07:41:44] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1009.eqiad.wmnet [07:44:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply [07:44:19] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config-next: apply [07:44:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [07:45:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config: apply [07:46:49] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1009.eqiad.wmnet [07:47:12] (03PS2) 10Muehlenhoff: maps: Move the setting for planet_sync_hours to the common role setting [puppet] - 10https://gerrit.wikimedia.org/r/1186533 (https://phabricator.wikimedia.org/T381565) [07:48:22] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [07:49:10] !log upgrading Envoy on chartmuseum* T402584 [07:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:14] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [07:50:18] !log upgraded envoy on dse-k8s-eqiad/dataset-config(-next) - T402584 [07:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:04] PROBLEM - Host ml-serve1009 is DOWN: PING CRITICAL - Packet loss = 100% [07:51:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186533 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:53:32] RECOVERY - Host ml-serve1009 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [07:53:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [07:54:12] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1009.eqiad.wmnet [07:54:13] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1009.eqiad.wmnet [07:54:27] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1010.eqiad.wmnet [07:55:53] (03PS1) 10Muehlenhoff: Update Ganeti alias for esams [puppet] - 10https://gerrit.wikimedia.org/r/1186925 (https://phabricator.wikimedia.org/T402259) [07:56:27] (03CR) 10Ayounsi: "overall lgtm, some comments" [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [07:59:43] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1010.eqiad.wmnet [08:00:10] (03CR) 10Muehlenhoff: [C:03+2] Update Ganeti alias for esams [puppet] - 10https://gerrit.wikimedia.org/r/1186925 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [08:00:41] (03CR) 10Muehlenhoff: [C:03+2] maps: Move the setting for planet_sync_hours to the common role setting [puppet] - 10https://gerrit.wikimedia.org/r/1186533 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:00:42] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:02:42] PROBLEM - Host ml-serve1010 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:12] RECOVERY - Host ml-serve1010 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [08:05:56] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:06:43] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1010.eqiad.wmnet [08:06:43] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1010.eqiad.wmnet [08:06:59] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1011.eqiad.wmnet [08:12:04] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1011.eqiad.wmnet [08:12:50] (03PS1) 10JMeybohm: Code changes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1186929 [08:13:48] (03PS1) 10Brouberol: runner: redact mysql passwors from the command string when reporting an error [dumps] - 10https://gerrit.wikimedia.org/r/1186930 (https://phabricator.wikimedia.org/T404162) [08:14:16] (03CR) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:14:25] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1011.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:14:35] (03PS2) 10JMeybohm: Fix rename success handling, BackendHandler additions [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1186929 [08:14:54] (03CR) 10JMeybohm: [V:03+2 C:03+2] Fix rename success handling, BackendHandler additions [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1186929 (owner: 10JMeybohm) [08:15:12] 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11166246 (10elukey) [08:15:21] !log jayme@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin1002" [08:15:22] !log jayme@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin1002 [08:16:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin1002 [08:16:15] !log jayme@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin1002" [08:16:54] PROBLEM - Host ml-serve1011 is DOWN: PING CRITICAL - Packet loss = 100% [08:17:18] 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11166249 (10elukey) 05Open→03Resolved All hosts done, and the provision cookbook now supports them. The supermicros for ML have a special firmware (Legacy/UEFI) that don... [08:18:33] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11166252 (10elukey) [08:19:12] (03PS2) 10Brouberol: runner: redact mysql password from the command string when reporting an error [dumps] - 10https://gerrit.wikimedia.org/r/1186930 (https://phabricator.wikimedia.org/T404162) [08:19:22] RECOVERY - Host ml-serve1011 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [08:19:27] (03CR) 10Ayounsi: EBGP Config: Move all ASN definitions to 'asns_mapping' (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:19:41] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1011.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:19:52] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1011.eqiad.wmnet [08:19:53] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1011.eqiad.wmnet [08:41:23] (03PS3) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) [08:42:03] (03CR) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:42:47] (03CR) 10CI reject: [V:04-1] EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:44:32] (03PS1) 10Muehlenhoff: Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) [08:44:32] (03PS4) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) [08:45:00] (03CR) 10CI reject: [V:04-1] Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:45:58] (03CR) 10CI reject: [V:04-1] EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:46:42] (03PS5) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) [08:48:32] (03CR) 10CI reject: [V:04-1] EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:51:15] (03PS6) 10Cathal Mooney: EBGP Config: Move all ASN definitions to 'asns_mapping' [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) [08:57:14] (03PS2) 10Muehlenhoff: Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) [09:00:25] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:04] (03PS15) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [09:03:57] (03PS1) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [09:04:06] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:04:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:06:13] (03PS2) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [09:07:03] (03CR) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [09:09:36] (03CR) 10CI reject: [V:04-1] P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 (owner: 10Slyngshede) [09:14:13] (03PS1) 10Filippo Giunchedi: hieradata: flip debdeploy::client::ensure to present [puppet] - 10https://gerrit.wikimedia.org/r/1186937 (https://phabricator.wikimedia.org/T336845) [09:14:14] (03PS1) 10Filippo Giunchedi: hieradata: exclude nfs/nfs4 from debdeploy::client in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1186938 (https://phabricator.wikimedia.org/T336845) [09:15:44] (03CR) 10Elukey: Setup maps2011 as master node for new maps/codfw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:19:50] FIRING: DiskSpace: Disk space deploy1003:9100:/ 2.797% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:25:27] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11166358 (10ayounsi) > We need to allow port number 48 on the Nokias, but not port number 0 as they start from 1 We already (and lazily) do : `min_value=0, max_value=48` which... [09:27:03] (03CR) 10Muehlenhoff: Setup maps2011 as master node for new maps/codfw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:28:10] !log cgoubert@deploy1003:/home$ sudo lvextend -L +20G /dev/vg0/root && sudo resize2fs /dev/vg0/root - T404060 [09:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:14] T404060: makeMailingList.php creates 30GB of data - https://phabricator.wikimedia.org/T404060 [09:29:35] RESOLVED: DiskSpace: Disk space deploy1003:9100:/ 2.773% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:29:52] (03PS3) 10Muehlenhoff: Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) [09:30:15] (03CR) 10Muehlenhoff: Setup maps2011 as master node for new maps/codfw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:30:18] claime: you may be interested in lvextend --resizefs --size FWIW [09:30:43] godog: Ah, didn't know you could add a switch to do both at the same time, thanks! [09:30:57] I haven't looked at lvextend's man page in something like a decade lol [09:31:28] sure np! yeah I think it's has been added "recently" [09:32:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1186937 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi) [09:32:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:33:08] (03CR) 10Muehlenhoff: [C:03+1] "Do we really still use NFS3? But LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1186938 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi) [09:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:36:03] (03CR) 10Elukey: Setup maps2011 as master node for new maps/codfw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:36:06] (03CR) 10Elukey: [C:03+1] Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:37:08] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on A:logstash-collector [09:37:24] (03PS1) 10Elukey: role::maps: fix tegola_swift_container for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1186945 (https://phabricator.wikimedia.org/T381565) [09:37:33] (03CR) 10David Caro: [C:03+1] "I think we tried this once, not sure why we did not get on with it though. Thanks for pushing for this again though." [puppet] - 10https://gerrit.wikimedia.org/r/1186938 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi) [09:37:59] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1186945 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:38:22] (03CR) 10Elukey: [C:03+2] role::maps: fix tegola_swift_container for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1186945 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:38:26] !log upgrading Envoy on Logstash T402584 [09:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:30] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [09:40:21] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1186937 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi) [09:40:49] (03CR) 10Elukey: [C:03+1] provision: on reboot wait for bios attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1186619 (owner: 10JHathaway) [09:41:43] (03PS3) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [09:41:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:42:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2011.codfw.wmnet [09:44:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on A:logstash-collector [09:45:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:58] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:46:32] (03PS4) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [09:47:42] !log upgrading Envoy on contint T402584 [09:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:45] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [09:48:01] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171 (10elukey) 03NEW [09:48:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2011.codfw.wmnet [09:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:49:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2012.codfw.wmnet [09:50:52] (03CR) 10Muehlenhoff: [C:03+2] Setup maps2011 as master node for new maps/codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/1186931 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:51:47] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11166493 (10elukey) [09:53:12] (03PS5) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [09:55:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2012.codfw.wmnet [09:56:41] !log upgrading Envoy on lists T402584 [09:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:45] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [09:56:53] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1248.eqiad.wmnet with reason: Maintenance [09:57:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1248 (T402763)', diff saved to https://phabricator.wikimedia.org/P83095 and previous config saved to /var/cache/conftool/dbconfig/20250910-095700-ladsgroup.json [09:57:05] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [09:57:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2013.codfw.wmnet [09:58:04] (03PS6) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1000) [10:02:33] (03CR) 10Vgutierrez: sre.loadbalancer: modify admin.py to accept 'reboot' action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [10:03:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2013.codfw.wmnet [10:03:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2014.codfw.wmnet [10:05:29] (03PS7) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [10:06:30] !log upgrading Envoy on lists T402584 [10:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:34] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [10:06:37] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11166565 (10MoritzMuehlenhoff) All baremetal installations of Envoy have been upgraded [10:06:42] !log upgrading Envoy on Phabricator T402584 [10:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:22] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1186946 (https://phabricator.wikimedia.org/T404178) [10:08:45] (03PS8) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [10:09:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2014.codfw.wmnet [10:09:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1011.eqiad.wmnet [10:10:57] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T404178 [10:11:01] T404178: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T404178 [10:11:27] (03PS1) 10Ladsgroup: db1181: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186947 (https://phabricator.wikimedia.org/T399955) [10:12:31] !log imported imposm3 0.14.1-2 T381565 [10:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:36] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [10:13:03] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11166591 (10MoritzMuehlenhoff) I've updated imposm once more to cherrypick two additional fixes: https://github.com/omniscale/imposm3/commit/dc3ebd0746ba7a73b2099c2cda343fc2c6d8d206... [10:13:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Glow up db1181 (T399955)', diff saved to https://phabricator.wikimedia.org/P83096 and previous config saved to /var/cache/conftool/dbconfig/20250910-101345-ladsgroup.json [10:13:50] T399955: Migrate s7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T399955 [10:14:45] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: Glow up [10:15:28] (03CR) 10Ladsgroup: [C:03+2] db1181: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186947 (https://phabricator.wikimedia.org/T399955) (owner: 10Ladsgroup) [10:15:56] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11166605 (10elukey) As a first very bare/minimum example I created: ` version: "prometheus/v1" service: "citoid" labels: owner: "sre" slos: - name: "requests-availability" objective: 99.5 descri... [10:16:32] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:17:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1011.eqiad.wmnet [10:17:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T402763)', diff saved to https://phabricator.wikimedia.org/P83097 and previous config saved to /var/cache/conftool/dbconfig/20250910-101758-ladsgroup.json [10:18:02] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [10:20:49] (03PS16) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [10:22:35] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1186946 (https://phabricator.wikimedia.org/T404178) (owner: 10Gerrit maintenance bot) [10:24:09] !log Starting s1 codfw failover from db2212 to db2203 - T404178 [10:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:13] T404178: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T404178 [10:25:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2203 to s1 primary T404178', diff saved to https://phabricator.wikimedia.org/P83098 and previous config saved to /var/cache/conftool/dbconfig/20250910-102507-fceratto.json [10:27:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1012.eqiad.wmnet [10:28:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:31:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:33:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P83099 and previous config saved to /var/cache/conftool/dbconfig/20250910-103305-ladsgroup.json [10:33:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1012.eqiad.wmnet [10:34:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repool db1181 T399955', diff saved to https://phabricator.wikimedia.org/P83100 and previous config saved to /var/cache/conftool/dbconfig/20250910-103436-ladsgroup.json [10:34:42] T399955: Migrate s7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T399955 [10:34:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.661 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:35:32] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1186948 (https://phabricator.wikimedia.org/T404180) [10:35:37] (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1186949 (https://phabricator.wikimedia.org/T404180) [10:36:18] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [10:37:30] jouncebot: nowandnext [10:37:30] For the next 0 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1000) [10:37:30] In 0 hour(s) and 22 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1100) [10:40:49] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T404180 [10:40:53] T404180: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T404180 [10:41:16] (03PS17) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [10:41:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 3.305 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:42:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set db1181 with weight 0 T404180', diff saved to https://phabricator.wikimedia.org/P83101 and previous config saved to /var/cache/conftool/dbconfig/20250910-104223-ladsgroup.json [10:48:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P83103 and previous config saved to /var/cache/conftool/dbconfig/20250910-104813-ladsgroup.json [10:48:53] (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1186948 (https://phabricator.wikimedia.org/T404180) (owner: 10Gerrit maintenance bot) [10:49:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1013.eqiad.wmnet [10:50:21] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db1173.eqiad.wmnet [10:50:28] !log Starting s7 eqiad failover from db1236 to db1181 - T404180 [10:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:32] T404180: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T404180 [10:50:42] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1173 - Upgrading db1173.eqiad.wmnet [10:50:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T404180', diff saved to https://phabricator.wikimedia.org/P83104 and previous config saved to /var/cache/conftool/dbconfig/20250910-105042-ladsgroup.json [10:51:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:51:31] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1173 - Upgrading db1173.eqiad.wmnet [10:52:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote db1181 to s7 primary and set section read-write T404180', diff saved to https://phabricator.wikimedia.org/P83106 and previous config saved to /var/cache/conftool/dbconfig/20250910-105205-ladsgroup.json [10:54:03] (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1186949 (https://phabricator.wikimedia.org/T404180) (owner: 10Gerrit maintenance bot) [10:54:19] !log ladsgroup@dns1004 START - running authdns-update [10:55:26] !log ladsgroup@dns1004 END - running authdns-update [10:55:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1013.eqiad.wmnet [10:56:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1236 T404180', diff saved to https://phabricator.wikimedia.org/P83107 and previous config saved to /var/cache/conftool/dbconfig/20250910-105650-ladsgroup.json [10:56:55] T404180: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T404180 [10:57:50] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1173.eqiad.wmnet [10:58:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1014.eqiad.wmnet [11:00:05] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1100). nyaa~ [11:00:57] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11166780 (10MoritzMuehlenhoff) [11:03:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T402763)', diff saved to https://phabricator.wikimedia.org/P83109 and previous config saved to /var/cache/conftool/dbconfig/20250910-110320-ladsgroup.json [11:03:25] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [11:03:36] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1249.eqiad.wmnet with reason: Maintenance [11:03:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1249 (T402763)', diff saved to https://phabricator.wikimedia.org/P83110 and previous config saved to /var/cache/conftool/dbconfig/20250910-110343-ladsgroup.json [11:04:05] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db1173.eqiad.wmnet [11:04:25] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1173 - Upgrading db1173.eqiad.wmnet [11:04:33] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1173 - Upgrading db1173.eqiad.wmnet [11:04:54] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186953 [11:05:18] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale-full only: 3 (gerrit2003, ...), Fresh: 135 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:05:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1014.eqiad.wmnet [11:05:53] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1236.eqiad.wmnet with reason: Clean up the mess [11:07:39] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1236.eqiad.wmnet [11:07:47] !log kick off full OSM import for the new maps cluster in codfw T381565 [11:07:48] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1236 - Upgrading db1236.eqiad.wmnet [11:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:51] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [11:07:55] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1236 - Upgrading db1236.eqiad.wmnet [11:08:26] (03CR) 10Ladsgroup: [C:03+1] upgrade.py: Restart Prometheus exporter [cookbooks] - 10https://gerrit.wikimedia.org/r/1186532 (owner: 10Federico Ceratto) [11:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:09:30] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2212.codfw.wmnet with reason: Maintenance [11:09:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T402763)', diff saved to https://phabricator.wikimedia.org/P83111 and previous config saved to /var/cache/conftool/dbconfig/20250910-110937-fceratto.json [11:09:42] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [11:10:43] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1173 gradually with 4 steps - Upgrade of db1173.eqiad.wmnet completed [11:13:34] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1236.eqiad.wmnet [11:15:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T402763)', diff saved to https://phabricator.wikimedia.org/P83113 and previous config saved to /var/cache/conftool/dbconfig/20250910-111503-fceratto.json [11:15:08] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [11:19:37] (03PS9) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [11:20:49] (03PS3) 10Tchanders: Enable temporary accounts on all medium-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) (owner: 10STran) [11:22:38] (03PS1) 10Ladsgroup: db1236: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186957 (https://phabricator.wikimedia.org/T399955) [11:22:40] (03CR) 10Tchanders: [C:03+1] Enable temporary accounts on all medium-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) (owner: 10STran) [11:23:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T402763)', diff saved to https://phabricator.wikimedia.org/P83114 and previous config saved to /var/cache/conftool/dbconfig/20250910-112354-ladsgroup.json [11:23:59] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [11:24:30] (03CR) 10Ladsgroup: [C:03+2] db1236: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186957 (https://phabricator.wikimedia.org/T399955) (owner: 10Ladsgroup) [11:25:15] !log installing Linux 6.1.148 on Bookworm hosts [11:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:31] PROBLEM - MariaDB Replica SQL: s7 #page on db1236 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:27:31] PROBLEM - MariaDB read only s7 on db1236 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:27:32] PROBLEM - MariaDB Replica IO: s7 #page on db1236 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:27:49] ^ federico3 expected? [11:27:53] PROBLEM - mysqld processes #page on db1236 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:27:54] no, looking [11:28:09] Amir1? [11:28:13] ignore the alert [11:28:22] ok [11:28:24] expired/removed downtime [11:28:33] (03PS1) 10Ayounsi: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 [11:28:33] !incidents [11:28:34] 6723 (UNACKED) db1236 (paged)/MariaDB Replica SQL: s7 (paged) [11:28:34] 6724 (UNACKED) db1236 (paged)/MariaDB Replica IO: s7 (paged) [11:28:34] 6725 (UNACKED) db1236 (paged)/mysqld processes (paged) [11:28:40] !ack 6723 [11:28:41] 6723 (ACKED) db1236 (paged)/MariaDB Replica SQL: s7 (paged) [11:28:44] !ack 6724 [11:28:45] 6724 (ACKED) db1236 (paged)/MariaDB Replica IO: s7 (paged) [11:28:51] !ack 6725 [11:28:51] 6725 (ACKED) db1236 (paged)/mysqld processes (paged) [11:29:06] Related to https://phabricator.wikimedia.org/T399955? [11:29:35] yup but I did downtime the whole thing for two hours [11:29:39] (03PS10) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [11:29:53] RECOVERY - mysqld processes #page on db1236 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:29:59] (03CR) 10CI reject: [V:04-1] [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi) [11:30:31] RECOVERY - MariaDB Replica SQL: s7 #page on db1236 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:30:33] RECOVERY - MariaDB Replica IO: s7 #page on db1236 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:30:33] RECOVERY - MariaDB read only s7 on db1236 is OK: Version 10.11.13-MariaDB-log, Uptime 76s, read_only: True, event_scheduler: True, 178.48 QPS, connection latency: 0.026488s, query latency: 0.000869s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:30:38] we should really just write something so it wouldn't page for depooled hosts [11:30:41] (03PS11) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [11:31:15] The whole point is that it alerted before making a mistake [11:31:33] otherwise it is too late- it requires human intervention before, not after [11:31:35] Amir1: Can you access the pooled state from the host itself? [11:32:08] claime: I don't think so but it should be in https://noc.wikimedia.org/dbconfig/eqiad.json or equivalent in codfw [11:32:18] so matter of http request [11:32:18] although it could be downgraded to a non-p*ging one when depooled [11:32:27] (03PS12) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [11:32:30] yeah [11:32:31] Amir1: Hmm, my thought was to dump the pooled state in the node-exporter config [11:32:46] so you can check it from inside the am alert [11:33:04] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2191.codfw.wmnet [11:33:09] s/config/file/ [11:33:25] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2191 - Upgrading db2191.codfw.wmnet [11:33:44] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2191 - Upgrading db2191.codfw.wmnet [11:37:11] > 10:40 ladsgroup@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T404180 [11:37:11] T404180: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T404180 [11:37:25] The downtime isn't expired, something is removing the downtime [11:39:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P83117 and previous config saved to /var/cache/conftool/dbconfig/20250910-113902-ladsgroup.json [11:40:29] (03CR) 10Cathal Mooney: "LGTM overall nice work! We should probably try to apply it on a device see if there is any issue?" [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi) [11:41:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:43:01] jouncebot: nowandnext [11:43:02] For the next 0 hour(s) and 16 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1100) [11:43:02] In 1 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1300) [11:43:58] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance [11:44:34] jouncebot: nowandnex [11:44:37] jouncebot: update [11:44:48] jouncebot: refresh [11:44:49] I refreshed my knowledge about deployments. [11:44:54] jouncebot: nowandnext [11:44:54] For the next 0 hour(s) and 15 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1100) [11:44:54] In 0 hour(s) and 15 minute(s): Deployment of CheckUser Suggested Investigations signals (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1200) [11:44:56] (03PS1) 10KartikMistry: Update Recommendation API to 2025-09-10-080042-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186961 (https://phabricator.wikimedia.org/T403730) [11:44:56] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2191 gradually with 4 steps - Upgrade of db2191.codfw.wmnet completed [11:45:42] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2212.codfw.wmnet with reason: Maintenance [11:45:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2212 (T402925)', diff saved to https://phabricator.wikimedia.org/P83120 and previous config saved to /var/cache/conftool/dbconfig/20250910-114549-ladsgroup.json [11:45:54] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:47:32] (03PS1) 10Btullis: Install the opensearch-operator-crd chart to the dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186964 (https://phabricator.wikimedia.org/T397246) [11:48:42] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1236* gradually with 4 steps - Work done [11:54:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P83122 and previous config saved to /var/cache/conftool/dbconfig/20250910-115409-ladsgroup.json [11:56:12] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1173 gradually with 4 steps - Upgrade of db1173.eqiad.wmnet completed [11:56:12] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1173.eqiad.wmnet [11:57:16] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11166996 (10MoritzMuehlenhoff) [11:59:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:00:04] Dreamy_Jazz: May I have your attention please! Deployment of CheckUser Suggested Investigations signals. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1200) [12:00:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T402763)', diff saved to https://phabricator.wikimedia.org/P83125 and previous config saved to /var/cache/conftool/dbconfig/20250910-120024-fceratto.json [12:00:29] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [12:01:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:00] (03CR) 10Federico Ceratto: [C:03+2] upgrade.py: Restart Prometheus exporter [cookbooks] - 10https://gerrit.wikimedia.org/r/1186532 (owner: 10Federico Ceratto) [12:04:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.969 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:05:43] (03CR) 10Stevemunene: [C:03+2] dse-k8s: Augment the dse-k8s cluster namespaces. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186487 (https://phabricator.wikimedia.org/T404068) (owner: 10Stevemunene) [12:06:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 7.541 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:07:15] (03PS1) 10Muehlenhoff: Setup maps1011 as master node for new maps/eqiad servers [puppet] - 10https://gerrit.wikimedia.org/r/1186969 (https://phabricator.wikimedia.org/T381565) [12:08:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186969 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:08:58] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-09-10-080042-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186961 (https://phabricator.wikimedia.org/T403730) (owner: 10KartikMistry) [12:09:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T402763)', diff saved to https://phabricator.wikimedia.org/P83127 and previous config saved to /var/cache/conftool/dbconfig/20250910-120917-ladsgroup.json [12:09:22] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [12:09:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:33] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1252.eqiad.wmnet with reason: Maintenance [12:09:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T402763)', diff saved to https://phabricator.wikimedia.org/P83128 and previous config saved to /var/cache/conftool/dbconfig/20250910-120940-ladsgroup.json [12:12:49] (03PS2) 10Ayounsi: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 [12:13:22] (03Merged) 10jenkins-bot: dse-k8s: Augment the dse-k8s cluster namespaces. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186487 (https://phabricator.wikimedia.org/T404068) (owner: 10Stevemunene) [12:13:57] (03PS3) 10Ayounsi: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 [12:14:02] (03CR) 10CI reject: [V:04-1] [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi) [12:16:31] (03Merged) 10jenkins-bot: Update Recommendation API to 2025-09-10-080042-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186961 (https://phabricator.wikimedia.org/T403730) (owner: 10KartikMistry) [12:17:11] (03PS2) 10Sbisson: CX3 Build 1.0.0+20250909 [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886) [12:19:19] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [12:20:25] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [12:21:09] (03CR) 10Ayounsi: [WIP] Analytics and loopback ACLs (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi) [12:21:38] !log kartik@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:23:03] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404104#11167076 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:25:40] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [12:25:47] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11167096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [12:26:04] !log kartik@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:29:44] (03CR) 10Vgutierrez: [C:03+1] "looking good 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1186935 (owner: 10Slyngshede) [12:30:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T402763)', diff saved to https://phabricator.wikimedia.org/P83131 and previous config saved to /var/cache/conftool/dbconfig/20250910-123011-ladsgroup.json [12:30:16] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [12:30:24] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2191 gradually with 4 steps - Upgrade of db2191.codfw.wmnet completed [12:30:25] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2191.codfw.wmnet [12:31:18] !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:33:51] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:34:10] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1236* gradually with 4 steps - Work done [12:34:38] I've finished with my window and so feel free to deploy etc. now [12:36:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:37:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2207.codfw.wmnet with reason: Maintenance [12:37:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T402763)', diff saved to https://phabricator.wikimedia.org/P83134 and previous config saved to /var/cache/conftool/dbconfig/20250910-123719-fceratto.json [12:37:24] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [12:37:45] (03CR) 10Cathal Mooney: [WIP] Analytics and loopback ACLs (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi) [12:38:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T402763)', diff saved to https://phabricator.wikimedia.org/P83135 and previous config saved to /var/cache/conftool/dbconfig/20250910-123831-fceratto.json [12:38:45] (03PS1) 10Btullis: Revert "Fix the partman recipe for dse-k8s-worker1014" [puppet] - 10https://gerrit.wikimedia.org/r/1186974 [12:39:40] Dreamy_Jazz: Thanks, I'll go ahead with the temporary accounts deployment, scheduled for the afternoon backport window [12:39:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:42:15] (03PS2) 10Btullis: Revert "Fix the partman recipe for dse-k8s-worker1014" [puppet] - 10https://gerrit.wikimedia.org/r/1186974 (https://phabricator.wikimedia.org/T399779) [12:42:24] (03CR) 10Btullis: [V:03+2 C:03+2] Revert "Fix the partman recipe for dse-k8s-worker1014" [puppet] - 10https://gerrit.wikimedia.org/r/1186974 (https://phabricator.wikimedia.org/T399779) (owner: 10Btullis) [12:44:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) (owner: 10STran) [12:44:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185838 (https://phabricator.wikimedia.org/T402181) (owner: 10STran) [12:44:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.380 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:44:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:45:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P83136 and previous config saved to /var/cache/conftool/dbconfig/20250910-124518-ladsgroup.json [12:45:25] (03PS1) 10Ladsgroup: db2185: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186975 (https://phabricator.wikimedia.org/T394371) [12:45:39] (03Merged) 10jenkins-bot: Enable temporary accounts on all medium-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) (owner: 10STran) [12:45:43] (03Merged) 10jenkins-bot: Enable temporary accounts on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185838 (https://phabricator.wikimedia.org/T402181) (owner: 10STran) [12:46:07] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:46:14] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11167133 (10BTullis) We decided that it's better to switch this back to legacy boot mode, since all of the other dse-k8s-workers are still using that. If we want to switch... [12:46:28] !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1185845|Enable temporary accounts on all medium-sized projects (T403399)]], [[gerrit:1185838|Enable temporary accounts on metawiki (T402181)]] [12:46:33] T403399: Deploy Temporary accounts to all medium-sized projects - https://phabricator.wikimedia.org/T403399 [12:46:34] T402181: Deploy Temporary accounts to all remaining small-sized projects - https://phabricator.wikimedia.org/T402181 [12:46:43] (03PS5) 10Jforrester: Increase max recursion depth in the orchestrator's composition language. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 (https://phabricator.wikimedia.org/T403594) (owner: 10Cory Massaro) [12:46:48] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2185.codfw.wmnet with reason: Glow up (T394371) [12:46:54] T394371: Migrate to MariaDB 10.11 - https://phabricator.wikimedia.org/T394371 [12:46:54] (03PS6) 10Jforrester: wikifunctions: Increase max recursion depth in the orchestrator's composition language [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 (https://phabricator.wikimedia.org/T403594) (owner: 10Cory Massaro) [12:47:06] jouncebot now [12:47:06] For the next 0 hour(s) and 12 minute(s): Deployment of CheckUser Suggested Investigations signals (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1200) [12:47:49] (03PS2) 10Ladsgroup: db2185: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186975 (https://phabricator.wikimedia.org/T394371) [12:47:54] (03CR) 10Ladsgroup: [V:03+2 C:03+2] db2185: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186975 (https://phabricator.wikimedia.org/T394371) (owner: 10Ladsgroup) [12:48:03] !log installing unbound security updates on bullseyre [12:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:05] !log installing unbound security updates on bullseye [12:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:32] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-09-03-123051 to 2025-09-09-171717 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186978 (https://phabricator.wikimedia.org/T380941) [12:50:43] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-09-04-003606 to 2025-09-08-191243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186979 (https://phabricator.wikimedia.org/T381061) [12:50:51] (03PS1) 10Jforrester: wikifunctions: Pre-emptively disable Wikidata reference fetching [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186980 (https://phabricator.wikimedia.org/T399425) [12:50:52] !log tchanders@deploy1003 tchanders, stran: Backport for [[gerrit:1185845|Enable temporary accounts on all medium-sized projects (T403399)]], [[gerrit:1185838|Enable temporary accounts on metawiki (T402181)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:50:54] (03PS1) 10Jforrester: wikifunctions: Enable (short-lived) caching of Wikidata items, as a test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186981 (https://phabricator.wikimedia.org/T397956) [12:50:58] (03PS1) 10Jforrester: wikifunctions: Expand caching of Wikidata items TTL to one day [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186982 (https://phabricator.wikimedia.org/T397956) [12:51:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2212.codfw.wmnet with reason: Maintenance [12:51:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T401906)', diff saved to https://phabricator.wikimedia.org/P83137 and previous config saved to /var/cache/conftool/dbconfig/20250910-125108-fceratto.json [12:51:12] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:52:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set weight on db2203', diff saved to https://phabricator.wikimedia.org/P83138 and previous config saved to /var/cache/conftool/dbconfig/20250910-125216-fceratto.json [12:52:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T401906)', diff saved to https://phabricator.wikimedia.org/P83139 and previous config saved to /var/cache/conftool/dbconfig/20250910-125224-fceratto.json [12:53:32] (03PS1) 10Ladsgroup: db1215: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186983 (https://phabricator.wikimedia.org/T394371) [12:55:02] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1215.eqiad.wmnet with reason: Glow up (T399540 T394371) [12:55:08] T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540 [12:55:08] T394371: Migrate to MariaDB 10.11 - https://phabricator.wikimedia.org/T394371 [12:57:13] !log tchanders@deploy1003 tchanders, stran: Continuing with sync [12:58:02] Tested that the account is attached to newly enabled wikis, and not attached to disabled wikis, so going ahead [12:58:17] (03PS2) 10Ladsgroup: db1215: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186983 (https://phabricator.wikimedia.org/T394371) [12:58:30] (03CR) 10Ladsgroup: [V:03+2 C:03+2] db1215: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186983 (https://phabricator.wikimedia.org/T394371) (owner: 10Ladsgroup) [12:59:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:00:00] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11167203 (10ayounsi) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1300). [13:00:05] stephanebisson and Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:24] o/ [13:00:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250910-130026-ladsgroup.json [13:00:40] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:41] o/ [13:00:52] The previous window finished early so I got started on my deployment - it's just syncing out now [13:01:27] Great! I'll go when you're done [13:01:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:02:35] !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1185845|Enable temporary accounts on all medium-sized projects (T403399)]], [[gerrit:1185838|Enable temporary accounts on metawiki (T402181)]] (duration: 16m 06s) [13:02:41] T403399: Deploy Temporary accounts to all medium-sized projects - https://phabricator.wikimedia.org/T403399 [13:02:41] T402181: Deploy Temporary accounts to all remaining small-sized projects - https://phabricator.wikimedia.org/T402181 [13:04:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.923 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:05:10] !log Updated Recommendation API to 2025-09-10-080042-production (T403730, T403976, T400562) [13:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:17] T403730: Treat article translation on mobile as (lead) section translation - https://phabricator.wikimedia.org/T403730 [13:05:18] T403976: Section suggestions: Appendix sections should not be considered as valid suggestions for sections - https://phabricator.wikimedia.org/T403976 [13:05:18] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:05:18] T400562: Create a unified Logstash dashboard displaying errors from cx, cxserver, RecommentationAPI, MinT - https://phabricator.wikimedia.org/T400562 [13:05:20] Forgot to log this earlier ^^ [13:05:36] stephanebisson: I'm around now if you need any help. [13:05:53] (03PS1) 10Ayounsi: Handle nokia interface name style [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1186985 [13:06:26] kart_ do you want to drive the deployment? [13:07:05] (03PS2) 10Ayounsi: Handle nokia interface name style [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1186985 (https://phabricator.wikimedia.org/T404146) [13:07:10] stephanebisson: sure. [13:07:22] (03CR) 10Ladsgroup: "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) (owner: 10STran) [13:07:35] stephanebisson: let's start with first patch? [13:08:26] kart_ lets start with the "Desktop publish_success:..." [13:08:41] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11167262 (10Jclark-ctr) @elukey Since we’re switching back to Legacy boot mode, I attempted to provision again, but it failed. When I went to manually change the BIOS, the... [13:08:48] (03PS3) 10Ayounsi: Handle nokia interface name style [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1186985 (https://phabricator.wikimedia.org/T404146) [13:08:58] stephanebisson: OK [13:09:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186651 (https://phabricator.wikimedia.org/T402975) (owner: 10Sbisson) [13:15:25] o/ [13:15:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T402763)', diff saved to https://phabricator.wikimedia.org/P83141 and previous config saved to /var/cache/conftool/dbconfig/20250910-131538-ladsgroup.json [13:15:43] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [13:15:47] (03PS4) 10Ayounsi: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 [13:15:54] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:17:03] (03PS1) 10Muehlenhoff: Enable the regular imports of the OSM updates and water lines on maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1186987 (https://phabricator.wikimedia.org/T381565) [13:17:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set weight on db2212', diff saved to https://phabricator.wikimedia.org/P83142 and previous config saved to /var/cache/conftool/dbconfig/20250910-131728-fceratto.json [13:17:50] !log installing apache2 security updates [13:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186987 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:18:20] PROBLEM - BFD status on ssw1-f1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:19:10] FIRING: BFDdown: BFD session down between ssw1-f1-codfw and 2620:0:860:13f::23 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:20:24] (03Merged) 10jenkins-bot: Desktop publish_success: add revid and pageid [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186651 (https://phabricator.wikimedia.org/T402975) (owner: 10Sbisson) [13:20:51] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1186651|Desktop publish_success: add revid and pageid (T402975)]] [13:20:55] T402975: CX event: desktop `publish_success` events don't have published_revision_id and published_page_id - https://phabricator.wikimedia.org/T402975 [13:21:43] (03PS11) 10Ayounsi: Use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 [13:21:51] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1186988 (https://phabricator.wikimedia.org/T404192) [13:22:10] (03CR) 10Ayounsi: Use Homer to configure the network (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [13:22:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P83143 and previous config saved to /var/cache/conftool/dbconfig/20250910-132239-fceratto.json [13:23:03] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T404192 [13:23:07] T404192: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T404192 [13:23:20] PROBLEM - BFD status on ssw1-e1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:24:10] FIRING: [2x] BFDdown: BFD session down between ssw1-e1-codfw and 2620:0:860:13f::23 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:24:48] !log btullis@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1233-1236].eqiad.wmnet [13:26:44] (03PS13) 10Slyngshede: P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 [13:26:55] !log kartik@deploy1003 kartik, sbisson: Backport for [[gerrit:1186651|Desktop publish_success: add revid and pageid (T402975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:26:59] T402975: CX event: desktop `publish_success` events don't have published_revision_id and published_page_id - https://phabricator.wikimedia.org/T402975 [13:27:22] stephanebisson: you can test the patch. [13:27:28] on it [13:29:06] (03CR) 10Elukey: [C:03+1] Setup maps1011 as master node for new maps/eqiad servers [puppet] - 10https://gerrit.wikimedia.org/r/1186969 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:29:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1233-1236].eqiad.wmnet [13:30:02] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1186988 (https://phabricator.wikimedia.org/T404192) (owner: 10Gerrit maintenance bot) [13:30:25] RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:28] Kart_ All good, go ahead [13:30:45] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1235.eqiad.wmnet [13:31:05] stephanebisson: sure [13:31:11] !log kartik@deploy1003 kartik, sbisson: Continuing with sync [13:31:33] (03CR) 10Slyngshede: P:cache::haproxy unittests for Lua module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1186935 (owner: 10Slyngshede) [13:31:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:31:39] !log Starting s8 codfw failover from db2161 to db2165 - T404192 [13:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:42] T404192: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T404192 [13:31:46] (03CR) 10Bking: [C:03+2] "Looks great, thank you for helping out here!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186964 (https://phabricator.wikimedia.org/T397246) (owner: 10Btullis) [13:32:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2165 to s8 primary T404192', diff saved to https://phabricator.wikimedia.org/P83145 and previous config saved to /var/cache/conftool/dbconfig/20250910-133231-fceratto.json [13:32:36] stephanebisson: I'll also +2 in advance for CX build patch [13:33:01] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS trixie [13:33:35] (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy unittests for Lua module [puppet] - 10https://gerrit.wikimedia.org/r/1186935 (owner: 10Slyngshede) [13:33:44] stephanebisson: Is that patch updated? I saw some updates in master. [13:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:40] Kart_ I updated the branch too but I did it manually because the version in master was also rebased and that included unwanted changes [13:34:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:35:22] stephanebisson: OK. Let's +2. [13:35:45] (03CR) 10KartikMistry: [C:03+2] CX3 Build 1.0.0+20250909 [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886) (owner: 10Sbisson) [13:36:32] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186651|Desktop publish_success: add revid and pageid (T402975)]] (duration: 15m 41s) [13:36:37] T402975: CX event: desktop `publish_success` events don't have published_revision_id and published_page_id - https://phabricator.wikimedia.org/T402975 [13:36:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886) (owner: 10Sbisson) [13:37:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:37:22] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2147.codfw.wmnet with reason: Maintenance [13:37:25] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2147 (T402763)', diff saved to https://phabricator.wikimedia.org/P83146 and previous config saved to /var/cache/conftool/dbconfig/20250910-133729-ladsgroup.json [13:37:34] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [13:37:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T401906)', diff saved to https://phabricator.wikimedia.org/P83147 and previous config saved to /var/cache/conftool/dbconfig/20250910-133746-fceratto.json [13:37:51] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:37:53] !incidents [13:37:53] 6723 (RESOLVED) db1236 (paged)/MariaDB Replica SQL: s7 (paged) [13:37:53] 6724 (RESOLVED) db1236 (paged)/MariaDB Replica IO: s7 (paged) [13:37:53] 6725 (RESOLVED) db1236 (paged)/mysqld processes (paged) [13:38:53] (03Merged) 10jenkins-bot: Install the opensearch-operator-crd chart to the dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186964 (https://phabricator.wikimedia.org/T397246) (owner: 10Btullis) [13:40:39] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2161.codfw.wmnet with reason: Maintenance [13:40:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T402763)', diff saved to https://phabricator.wikimedia.org/P83148 and previous config saved to /var/cache/conftool/dbconfig/20250910-134046-fceratto.json [13:42:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.281s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:42:19] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon2004-dev.codfw.wmnet with OS trixie [13:42:25] RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:30] (03CR) 10Btullis: "Yes, I think that the 'templates' path element is required. I didn't spot it at first, so when I tried a `rake run_locally` with only the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186964 (https://phabricator.wikimedia.org/T397246) (owner: 10Btullis) [13:45:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:53] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:46:20] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:47:29] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20250909 [extensions/ContentTranslation] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186650 (https://phabricator.wikimedia.org/T374886) (owner: 10Sbisson) [13:47:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T402763)', diff saved to https://phabricator.wikimedia.org/P83149 and previous config saved to /var/cache/conftool/dbconfig/20250910-134734-fceratto.json [13:47:39] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [13:47:55] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:56] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1186650|CX3 Build 1.0.0+20250909 (T374886 T394998 T399122 T399125 T399133 T403730 T404045 T404093)]] [13:48:13] (03Abandoned) 10Brouberol: runner: redact mysql password from the command string when reporting an error [dumps] - 10https://gerrit.wikimedia.org/r/1186930 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [13:48:16] T374886: SX: Use source/target languages from URL params everywhere - https://phabricator.wikimedia.org/T374886 [13:48:16] T394998: Translation time estimations are very underestimated - https://phabricator.wikimedia.org/T394998 [13:48:16] T399122: Show aggregate section information with difficulty indicators in “Expand with new sections” list - https://phabricator.wikimedia.org/T399122 [13:48:17] T399125: Instrumentation: log recommendation difficulty level - https://phabricator.wikimedia.org/T399125 [13:48:17] T399133: Show easy recommendations to beginners - https://phabricator.wikimedia.org/T399133 [13:48:17] T403730: Treat article translation on mobile as (lead) section translation - https://phabricator.wikimedia.org/T403730 [13:48:18] T404045: CX Unified Dashboard: Favorite suggestions display current languages instead of the suggestion languages - https://phabricator.wikimedia.org/T404045 [13:48:18] T404093: Decide article and section difficulty level size thresholds - https://phabricator.wikimedia.org/T404093 [13:48:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1235.eqiad.wmnet [13:48:31] !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:49:44] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11167514 (10elukey) ==== Error Budget calculations ==== The grafana dashboard's JSON shows these two expressions: ` "description": "This graph shows the month error budget burn down chart (starts the 1st... [13:49:52] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.543 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:50:03] !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:51:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2161 weight based on db2165', diff saved to https://phabricator.wikimedia.org/P83150 and previous config saved to /var/cache/conftool/dbconfig/20250910-135119-fceratto.json [13:53:51] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:54:25] !log kartik@deploy1003 sbisson, kartik: Backport for [[gerrit:1186650|CX3 Build 1.0.0+20250909 (T374886 T394998 T399122 T399125 T399133 T403730 T404045 T404093)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:54:40] T374886: SX: Use source/target languages from URL params everywhere - https://phabricator.wikimedia.org/T374886 [13:54:40] T394998: Translation time estimations are very underestimated - https://phabricator.wikimedia.org/T394998 [13:54:40] T399122: Show aggregate section information with difficulty indicators in “Expand with new sections” list - https://phabricator.wikimedia.org/T399122 [13:54:41] T399125: Instrumentation: log recommendation difficulty level - https://phabricator.wikimedia.org/T399125 [13:54:41] T399133: Show easy recommendations to beginners - https://phabricator.wikimedia.org/T399133 [13:54:42] T403730: Treat article translation on mobile as (lead) section translation - https://phabricator.wikimedia.org/T403730 [13:54:42] T404045: CX Unified Dashboard: Favorite suggestions display current languages instead of the suggestion languages - https://phabricator.wikimedia.org/T404045 [13:54:43] T404093: Decide article and section difficulty level size thresholds - https://phabricator.wikimedia.org/T404093 [13:55:26] stephanebisson: ready for testing! [13:55:31] on it [13:55:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Reset weights on db2212 and db2203', diff saved to https://phabricator.wikimedia.org/P83151 and previous config saved to /var/cache/conftool/dbconfig/20250910-135553-fceratto.json [13:56:40] (03CR) 10Scott French: "Many thanks for the review, Valentin." [puppet] - 10https://gerrit.wikimedia.org/r/1184914 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French) [13:56:46] (03CR) 10Scott French: "+cc @cgoubert@wikimedia.org FYI, since this will conflict structurally, but not functionally, with Ibe367b528408886f34748e1b935b192a6d8c33" [puppet] - 10https://gerrit.wikimedia.org/r/1184915 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French) [13:57:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set weight on db2204', diff saved to https://phabricator.wikimedia.org/P83152 and previous config saved to /var/cache/conftool/dbconfig/20250910-135720-fceratto.json [13:57:29] (03CR) 10Volans: "Great work, thanks a lot to have kept iterating on it! I've left some questions, couple of possible small issues and few nits inline. I th" [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:57:58] 10SRE-SLO, 10Observability-Metrics: Clear & Backfill Tonecheck Pyrra Metrics - https://phabricator.wikimedia.org/T400071#11167580 (10herron) p:05Triage→03Medium [13:58:23] 10SRE-SLO, 10Observability-Metrics: Clear & Backfill Tonecheck Pyrra Metrics - https://phabricator.wikimedia.org/T400071#11167591 (10herron) 05Open→03Resolved Tonecheck metrics have been backfilled with a clean history [13:58:40] (03PS1) 10Clément Goubert: rest-gateway: Temp bump to 6 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186996 (https://phabricator.wikimedia.org/T402412) [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1400) [14:00:22] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-09-03-123051 to 2025-09-09-171717 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186978 (https://phabricator.wikimedia.org/T380941) (owner: 10Jforrester) [14:01:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T402763)', diff saved to https://phabricator.wikimedia.org/P83153 and previous config saved to /var/cache/conftool/dbconfig/20250910-140147-ladsgroup.json [14:01:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2161.codfw.wmnet with reason: Maintenance [14:01:53] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [14:01:59] James_F: We're still deploying. I'll ping once done. [14:02:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T401906)', diff saved to https://phabricator.wikimedia.org/P83154 and previous config saved to /var/cache/conftool/dbconfig/20250910-140159-fceratto.json [14:02:04] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:02:07] kart_: It's fine, services not MW land. [14:02:16] (03CR) 10Scott French: [C:03+1] "Thank you! While ideally this probably wouldn't be needed, it would be nice not to have to think about it while making the DNS changes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186996 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [14:02:19] Thanks [14:02:19] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-09-03-123051 to 2025-09-09-171717 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186978 (https://phabricator.wikimedia.org/T380941) (owner: 10Jforrester) [14:02:43] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Temp bump to 6 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186996 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [14:03:49] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:03:51] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:04:07] (03PS1) 10Btullis: Add four new (renamed) an-worker nodes to the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1186998 (https://phabricator.wikimedia.org/T398438) [14:04:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T401906)', diff saved to https://phabricator.wikimedia.org/P83155 and previous config saved to /var/cache/conftool/dbconfig/20250910-140410-fceratto.json [14:04:18] (03Merged) 10jenkins-bot: rest-gateway: Temp bump to 6 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186996 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [14:04:19] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:04:24] (03CR) 10Vgutierrez: [C:03+1] P:trafficserver::backend: add mw-next-routing [puppet] - 10https://gerrit.wikimedia.org/r/1184914 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French) [14:04:30] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:04:31] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:04:35] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:04:42] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:04:51] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:04:56] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:05:02] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:05:35] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:05:39] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:06:21] kart_ you can go ahed [14:06:24] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:06:28] cool [14:06:28] kart_ you can go ahead [14:06:32] !log kartik@deploy1003 sbisson, kartik: Continuing with sync [14:06:42] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-09-04-003606 to 2025-09-08-191243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186979 (https://phabricator.wikimedia.org/T381061) (owner: 10Jforrester) [14:07:08] (03CR) 10Vgutierrez: [C:03+1] hieradata: add mw-next-routing to ATS tslua plugin chains [puppet] - 10https://gerrit.wikimedia.org/r/1184915 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French) [14:08:30] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-09-04-003606 to 2025-09-08-191243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186979 (https://phabricator.wikimedia.org/T381061) (owner: 10Jforrester) [14:08:34] (03CR) 10Brouberol: [C:03+1] Add four new (renamed) an-worker nodes to the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1186998 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [14:08:36] (03CR) 10Ssingh: wmnet: Introduce rest-gateway-ro (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1182853 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [14:09:10] (03CR) 10Stevemunene: [C:03+1] Add four new (renamed) an-worker nodes to the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1186998 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [14:09:53] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:10:20] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:10:39] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:11:14] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:11:21] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:12:05] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186650|CX3 Build 1.0.0+20250909 (T374886 T394998 T399122 T399125 T399133 T403730 T404045 T404093)]] (duration: 24m 08s) [14:12:19] T374886: SX: Use source/target languages from URL params everywhere - https://phabricator.wikimedia.org/T374886 [14:12:20] T394998: Translation time estimations are very underestimated - https://phabricator.wikimedia.org/T394998 [14:12:20] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:12:20] T399122: Show aggregate section information with difficulty indicators in “Expand with new sections” list - https://phabricator.wikimedia.org/T399122 [14:12:20] T399125: Instrumentation: log recommendation difficulty level - https://phabricator.wikimedia.org/T399125 [14:12:21] T399133: Show easy recommendations to beginners - https://phabricator.wikimedia.org/T399133 [14:12:22] T403730: Treat article translation on mobile as (lead) section translation - https://phabricator.wikimedia.org/T403730 [14:12:22] T404045: CX Unified Dashboard: Favorite suggestions display current languages instead of the suggestion languages - https://phabricator.wikimedia.org/T404045 [14:12:23] T404093: Decide article and section difficulty level size thresholds - https://phabricator.wikimedia.org/T404093 [14:12:50] (03PS1) 10Andrew Bogott: eqiad1 cloudcontrols => ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187000 [14:13:37] (03CR) 10Jforrester: [C:03+2] wikifunctions: Increase max recursion depth in the orchestrator's composition language [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 (https://phabricator.wikimedia.org/T403594) (owner: 10Cory Massaro) [14:13:37] stephanebisson86: all done [14:13:55] kart_ thanks! [14:14:46] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11167669 (10elukey) ==== Alerting ==== Adding some alerting rules adds the following: ` - name: sloth-slo-alerts-citoid-requests-availability rules: - alert: CitoidHighErrorRate expr: | (... [14:14:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:21] (03CR) 10Andrew Bogott: [C:03+2] eqiad1 cloudcontrols => ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187000 (owner: 10Andrew Bogott) [14:16:06] (03Merged) 10jenkins-bot: wikifunctions: Increase max recursion depth in the orchestrator's composition language [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 (https://phabricator.wikimedia.org/T403594) (owner: 10Cory Massaro) [14:16:32] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P83156 and previous config saved to /var/cache/conftool/dbconfig/20250910-141655-ladsgroup.json [14:17:14] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:17:38] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:18:11] (03CR) 10Btullis: [C:03+2] Add four new (renamed) an-worker nodes to the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1186998 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [14:19:16] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:19:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P83157 and previous config saved to /var/cache/conftool/dbconfig/20250910-141917-fceratto.json [14:19:41] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:19:55] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:20:10] kart_: Are you complete at your end? [14:20:30] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:20:39] (03CR) 10Btullis: [C:03+1] druid: Bring druid1012.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/1182698 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene) [14:20:44] (03CR) 10Btullis: [C:03+1] druid: Bring druid1013.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/1182699 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene) [14:20:56] (03CR) 10Btullis: [C:03+1] druid: Add druid druid101[2-3] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1182700 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene) [14:20:58] (03CR) 10Jforrester: [C:03+2] wikifunctions: Pre-emptively disable Wikidata reference fetching [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186980 (https://phabricator.wikimedia.org/T399425) (owner: 10Jforrester) [14:21:05] (03CR) 10Jforrester: [C:03+2] wikifunctions: Enable (short-lived) caching of Wikidata items, as a test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186981 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester) [14:21:17] (03CR) 10Btullis: [C:03+1] druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene) [14:22:48] (03Merged) 10jenkins-bot: wikifunctions: Pre-emptively disable Wikidata reference fetching [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186980 (https://phabricator.wikimedia.org/T399425) (owner: 10Jforrester) [14:22:55] RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:59] (03PS4) 10JHathaway: provision: on reboot wait for bios attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1186619 [14:23:03] (03Merged) 10jenkins-bot: wikifunctions: Enable (short-lived) caching of Wikidata items, as a test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186981 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester) [14:24:31] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:24:48] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:25:11] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:25:32] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:25:37] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:26:00] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:26:15] James_F: sorry. Yes. Done. [14:26:22] kart_: Awesome, no worries. [14:26:39] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186953 (owner: 10PipelineBot) [14:27:15] (03PS1) 10Scott French: shellbox-constraints: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186576 (https://phabricator.wikimedia.org/T403284) [14:27:53] (03PS1) 10Andrew Bogott: eqiad1: fix hiera key for moving cloudcontrols to ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187004 [14:27:59] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187004 (owner: 10Andrew Bogott) [14:28:28] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186953 (owner: 10PipelineBot) [14:28:55] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:20] RECOVERY - BFD status on ssw1-f1-codfw.mgmt is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:29:20] RECOVERY - BFD status on ssw1-e1-codfw.mgmt is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:30:04] (03CR) 10Andrew Bogott: [C:03+2] eqiad1: fix hiera key for moving cloudcontrols to ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187004 (owner: 10Andrew Bogott) [14:30:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1430) [14:30:09] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:30:30] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:31:18] (03PS1) 10Jforrester: wikifunctions: Expand caching of Wikidata items TTL to one minute [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187006 (https://phabricator.wikimedia.org/T397956) [14:31:26] (03CR) 10Jforrester: [C:03+2] wikifunctions: Expand caching of Wikidata items TTL to one minute [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187006 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester) [14:31:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 7.666 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:32:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P83158 and previous config saved to /var/cache/conftool/dbconfig/20250910-143202-ladsgroup.json [14:32:50] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:33:09] (03Merged) 10jenkins-bot: wikifunctions: Expand caching of Wikidata items TTL to one minute [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187006 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester) [14:33:40] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:34:02] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:34:10] RESOLVED: [2x] BFDdown: BFD session down between ssw1-e1-codfw and 2620:0:860:13f::23 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:34:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P83159 and previous config saved to /var/cache/conftool/dbconfig/20250910-143424-fceratto.json [14:34:29] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:34:47] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:34:52] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:35:13] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:35:26] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:36:01] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:36:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186517 (owner: 10Jforrester) [14:36:23] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:37:25] (03CR) 10Scott French: [C:03+2] P:trafficserver::backend: add mw-next-routing [puppet] - 10https://gerrit.wikimedia.org/r/1184914 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French) [14:38:09] 06SRE, 10DNS, 06FR-donorrelations, 06Traffic: Custom URL for survey pop-up - https://phabricator.wikimedia.org/T400278#11167815 (10ssingh) Hi @EBrill-WMF: Happy to set up a time to talk about this; please let me know and I can set that up. Thanks. [14:39:21] (03PS5) 10Ayounsi: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 [14:39:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.484 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:40:14] (03CR) 10Hnowlan: [C:03+1] shellbox-constraints: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186576 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [14:40:40] (03CR) 10Muehlenhoff: [C:03+2] Setup maps1011 as master node for new maps/eqiad servers [puppet] - 10https://gerrit.wikimedia.org/r/1186969 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:41:08] (03CR) 10JHathaway: [C:03+2] provision: on reboot wait for bios attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1186619 (owner: 10JHathaway) [14:41:17] (03Merged) 10jenkins-bot: Improve performance of preferred labels subquery [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1186517 (owner: 10Jforrester) [14:41:46] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1186517|Improve performance of preferred labels subquery]] [14:43:33] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: flip debdeploy::client::ensure to present [puppet] - 10https://gerrit.wikimedia.org/r/1186937 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi) [14:43:42] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: exclude nfs/nfs4 from debdeploy::client in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1186938 (https://phabricator.wikimedia.org/T336845) (owner: 10Filippo Giunchedi) [14:45:37] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:46:25] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:46:29] (03PS2) 10Jforrester: wikifunctions: Expand caching of Wikidata items TTL to one day [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186982 (https://phabricator.wikimedia.org/T397956) [14:47:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T402763)', diff saved to https://phabricator.wikimedia.org/P83160 and previous config saved to /var/cache/conftool/dbconfig/20250910-144710-ladsgroup.json [14:47:14] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [14:47:25] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2155.codfw.wmnet with reason: Maintenance [14:47:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T402763)', diff saved to https://phabricator.wikimedia.org/P83161 and previous config saved to /var/cache/conftool/dbconfig/20250910-144732-ladsgroup.json [14:47:39] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1186517|Improve performance of preferred labels subquery]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:48:38] (03CR) 10Stevemunene: [C:03+2] Change all druid_public hosts references to use svc url [puppet] - 10https://gerrit.wikimedia.org/r/1185922 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene) [14:49:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T401906)', diff saved to https://phabricator.wikimedia.org/P83162 and previous config saved to /var/cache/conftool/dbconfig/20250910-144932-fceratto.json [14:49:37] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:49:40] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:50:08] !log jforrester@deploy1003 jforrester: Continuing with sync [14:51:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:53:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.288s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:54:05] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): Audit Moderator Tools applications for use of the "m." sub-domain - https://phabricator.wikimedia.org/T404207 (10Kgraessle) 03NEW [14:54:48] (03CR) 10Jforrester: [C:04-1] "Still investigating." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186982 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester) [14:55:30] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186517|Improve performance of preferred labels subquery]] (duration: 13m 44s) [14:55:34] !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=rest-gateway,name=codfw [reason: Depooling codfw ahead of switch to active-passive - T400131] [14:55:38] T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131 [14:55:48] OK, all done here. [14:56:09] (03PS1) 10Ahmon Dancy: scap::master: Update advise in /srv/patches git pre-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672) [14:56:57] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2212 gradually with 4 steps - pooling in [14:57:01] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2212 gradually with 4 steps - pooling in [14:57:31] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:58:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.67s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:59:27] (03PS1) 10Muehlenhoff: autoinstall: Stop pulling in udebs from unstable now that trixie is stable [puppet] - 10https://gerrit.wikimedia.org/r/1187012 [15:02:29] (03PS2) 10Muehlenhoff: autoinstall: Stop pulling in udebs from unstable now that trixie is stable [puppet] - 10https://gerrit.wikimedia.org/r/1187012 [15:03:43] (03CR) 10Muehlenhoff: [C:03+2] autoinstall: Stop pulling in udebs from unstable now that trixie is stable [puppet] - 10https://gerrit.wikimedia.org/r/1187012 (owner: 10Muehlenhoff) [15:05:19] (03CR) 10SBassett: [C:03+1] "Not sure if all secteam folks are sudoers on deployment though..." [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [15:06:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:08:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2204 and db2207 weights after flip', diff saved to https://phabricator.wikimedia.org/P83163 and previous config saved to /var/cache/conftool/dbconfig/20250910-150800-fceratto.json [15:08:27] (03CR) 10Ahmon Dancy: "ooh, good point. Can you run `sudo -l` on the deploy server and see if fix-staging-perms is listed?" [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [15:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:08:58] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:50] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [15:10:57] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS trixie [15:11:44] (03CR) 10SBassett: [C:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [15:12:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T402763)', diff saved to https://phabricator.wikimedia.org/P83164 and previous config saved to /var/cache/conftool/dbconfig/20250910-151216-ladsgroup.json [15:12:22] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [15:13:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11167999 (10elukey) @Jclark-ctr I synced with Jesse and he didn't make anything during yesterday's tests that could end up in this situation, so maybe a reset to factory defaults could help to s... [15:14:17] (03PS1) 10Xcollazo: dumps: disable rsync access for 2 dead dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1187016 (https://phabricator.wikimedia.org/T402987) [15:16:31] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=rest-gateway,name=codfw [reason: Repooling codfw while investigating provisioning of proton service - T400131] [15:16:36] T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131 [15:18:54] !log btullis@cumin1003 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [15:19:02] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [15:22:23] !log disable OSPF on mr1-eqsin to test BGP [15:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:34] !log pt1979@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mr1-eqsin with reason: router upgrade [15:22:46] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11168027 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=eb9214cb-2708-4519-b3c6-38d4f0f7cf7d) set by pt1979@cumin1002 for 1:00:00 o... [15:23:57] (03PS1) 10Muehlenhoff: Create component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187019 (https://phabricator.wikimedia.org/T404114) [15:23:58] (03PS1) 10Muehlenhoff: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) [15:26:34] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:26:46] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [15:26:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:27:19] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:27:19] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:27:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P83165 and previous config saved to /var/cache/conftool/dbconfig/20250910-152725-ladsgroup.json [15:27:37] (03CR) 10Andrew Bogott: [C:03+2] Trove: install backdoor VM keys on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/1181185 (https://phabricator.wikimedia.org/T402317) (owner: 10Andrew Bogott) [15:27:48] (03PS1) 10Scott French: admin_ng: bump cpu quota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131) [15:28:07] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:28:10] jouncebot: nowandnext [15:28:10] No deployments scheduled for the next 1 hour(s) and 31 minute(s) [15:28:10] In 1 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1700) [15:28:17] 06SRE, 10bacula, 10Data-Persistence-Backup, 10Infrastructure Security, and 3 others: Trixie bacula-fd package incompatible with our bacula installation - https://phabricator.wikimedia.org/T404114#11168066 (10fgiunchedi) +SRE for visibility [15:28:34] borrowing mw-debug in codfw for a quick experiment, holler if anyone needs to deploy anything :) [15:29:16] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:29:25] (03CR) 10Jcrespo: [C:03+1] "Fine to go now" [puppet] - 10https://gerrit.wikimedia.org/r/1187019 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:30:08] rzl: mw-debug or mw-experimental? :) https://wikitech.wikimedia.org/wiki/Mw-experimental [15:30:29] (I don’t remember if you were involved in that so mentioning it just in case 😇) [15:30:39] (03PS1) 10Clare Ming: xLab: Deploy v1.0.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187022 [15:30:45] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:31:23] Lucas_WMDE: debug! testing a helmfile change not a MW code change so not experimental's wheelhouse but I appreciate it anyway :) [15:31:28] (03CR) 10Jcrespo: "This is ok as it is, but, as I mentioned on the ticket, can we move the decision to the profile, as the logic will change in the future? I" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:31:30] ok :) [15:32:07] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v1.0.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187022 (owner: 10Clare Ming) [15:33:13] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:33:49] (03Merged) 10jenkins-bot: xLab: Deploy v1.0.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187022 (owner: 10Clare Ming) [15:33:58] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [15:33:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:07] (03CR) 10Papaul: [C:03+2] Remove OSFP from mr1-eqsin and cr2/3-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1186682 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [15:34:11] (03CR) 10Jcrespo: "Actually, I have an additional question- to revert it, what actions are needed? will removing the repo setup be enough for a clean upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:34:21] (03CR) 10Muehlenhoff: "Not sure we even need to make ot configurable? It will be needed for all Trixie nodes and once the server side is updated it will be neede" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:35:09] btullis@cumin1003 roll-restart-masters (PID 1963407) is awaiting input [15:35:23] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:35:25] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:36:30] (03CR) 10Muehlenhoff: "Yes, once the server side is compatible, we:" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:36:32] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:38:15] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:38:57] (03CR) 10Hnowlan: [C:03+1] admin_ng: bump cpu quota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French) [15:39:08] (03CR) 10RLazarus: [C:03+1] admin_ng: bump cpu quota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French) [15:40:16] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [15:40:42] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:41:12] done with mw-debug for now, might go again before the 17:00 infra window, remains to be seen [15:41:57] (03PS1) 10Papaul: Fix typo on ospf3 [homer/public] - 10https://gerrit.wikimedia.org/r/1187024 (https://phabricator.wikimedia.org/T294845) [15:43:31] (03CR) 10Papaul: [C:03+2] Fix typo on ospf3 [homer/public] - 10https://gerrit.wikimedia.org/r/1187024 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [15:43:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Swap db2213 and 2223 weights', diff saved to https://phabricator.wikimedia.org/P83166 and previous config saved to /var/cache/conftool/dbconfig/20250910-154331-fceratto.json [15:43:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P83167 and previous config saved to /var/cache/conftool/dbconfig/20250910-154340-ladsgroup.json [15:44:36] (03CR) 10Jcrespo: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:44:38] (03CR) 10Scott French: "Thank you both for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French) [15:44:42] (03CR) 10Scott French: [C:03+2] admin_ng: bump cpu quota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French) [15:49:31] borrowing mw-debug codfw again after all, don't mind me [15:49:38] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:49:48] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:50:10] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:50:11] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:51:46] (03PS2) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:51:59] (03Merged) 10jenkins-bot: admin_ng: bump cpu quota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187021 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French) [15:53:26] (03CR) 10Jcrespo: "It is almost the same thing, but I want to protect against myself: 1) on server upgrade, 2) on cloud services, 3) on next os upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:53:34] (03CR) 10Elukey: Introduce v1 xLab / MPIC SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [15:54:30] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS trixie [15:54:47] (03PS3) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:55:39] (03PS4) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:56:32] (03PS1) 10Clare Ming: xLab: Deploy v1.0.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187027 (https://phabricator.wikimedia.org/T371225) [15:56:50] (03CR) 10Jcrespo: "The idea is, when testing 15 on the storage server, I will be able to upgrade one host at a time, hence the hiera key." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:57:13] !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:57:48] (03Abandoned) 10Clare Ming: xLab: Deploy v1.0.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186616 (https://phabricator.wikimedia.org/T371225) (owner: 10Clare Ming) [15:58:02] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:58:39] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v1.0.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187027 (https://phabricator.wikimedia.org/T371225) (owner: 10Clare Ming) [15:58:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T402763)', diff saved to https://phabricator.wikimedia.org/P83168 and previous config saved to /var/cache/conftool/dbconfig/20250910-155847-ladsgroup.json [15:58:52] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [15:59:02] (03CR) 10Jcrespo: "I wouldn't mind an extra pair of eyes, I am a bit distracted and making mistakes." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:59:04] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2172.codfw.wmnet with reason: Maintenance [15:59:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T402763)', diff saved to https://phabricator.wikimedia.org/P83169 and previous config saved to /var/cache/conftool/dbconfig/20250910-155911-ladsgroup.json [15:59:30] !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:00:13] (03Merged) 10jenkins-bot: xLab: Deploy v1.0.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187027 (https://phabricator.wikimedia.org/T371225) (owner: 10Clare Ming) [16:00:35] !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:01:47] !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:02:07] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [16:03:05] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [16:03:25] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [16:04:09] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:04:32] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:05:04] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:05:34] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm [16:05:51] (03PS1) 10Btullis: Temporarily exlude the 4 new hadoop workers to facilitate vlan change [puppet] - 10https://gerrit.wikimedia.org/r/1187029 (https://phabricator.wikimedia.org/T398438) [16:06:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) (owner: 10NMW03) [16:06:57] (03PS5) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [16:07:28] (03CR) 10CI reject: [V:04-1] bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [16:10:12] (03PS6) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [16:11:50] !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=rest-gateway,name=codfw [reason: Depooling codfw ahead of switch to active-passive - T400131] [16:11:54] T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131 [16:12:32] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187029 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [16:12:49] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [16:14:33] (03CR) 10DCausse: [C:03+1] cirrus: Reduce galleries weight in search on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [16:15:02] (03CR) 10Jcrespo: [C:03+1] "This is ok to me, but I am not saying it has to be like this, feel free to critizise it further." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [16:16:30] (03CR) 10Jcrespo: [C:03+1] "This will allow me to do hieradata/hosts/people1005.yaml: profile::backup::client_version: 15 # individually for testing." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [16:16:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:17:17] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frmx2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403970#11168370 (10Jhancock.wm) [16:18:08] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11168372 (10Papaul) 05Open→03Resolved mr1-eqsin and cr2/3-eqsin are now running BGP for the management network. Resolving this task. Thanks @ay... [16:18:14] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403965#11168374 (10Jhancock.wm) [16:21:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:23:23] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [16:23:58] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frmx2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403970#11168432 (10Jhancock.wm) a:03Papaul [16:24:20] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403965#11168433 (10Jhancock.wm) a:03Papaul [16:24:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T402763)', diff saved to https://phabricator.wikimedia.org/P83170 and previous config saved to /var/cache/conftool/dbconfig/20250910-162421-ladsgroup.json [16:24:26] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [16:24:39] 06SRE, 06Data-Engineering, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live (is_pageview is always "-") - https://phabricator.wikimedia.org/T402612#11168439 (10Ottomata) @JAllemandou can you weigh in here? @CDanis def possible to reuse the logic in Java, but it would probably be a la... [16:26:21] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [16:27:40] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [16:28:48] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [16:39:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P83171 and previous config saved to /var/cache/conftool/dbconfig/20250910-163929-ladsgroup.json [16:40:18] (03CR) 10Scott French: [C:03+2] shellbox-constraints: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186576 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [16:40:53] (03CR) 10Ahmon Dancy: "Thanks Scott. I looked at a sampling of other security team users and they're all in the `deployment` group which is what grants this priv" [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [16:42:00] (03Merged) 10jenkins-bot: shellbox-constraints: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186576 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [16:42:46] (03PS7) 10Cathal Mooney: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) [16:43:46] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:43:50] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:44:02] (03CR) 10CI reject: [V:04-1] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [16:44:21] (03PS1) 10CDanis: otelcol: fix service name munging post-Envoy upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187036 (https://phabricator.wikimedia.org/T380211) [16:45:03] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:45:15] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:46:13] (03CR) 10RLazarus: [C:03+1] otelcol: fix service name munging post-Envoy upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187036 (https://phabricator.wikimedia.org/T380211) (owner: 10CDanis) [16:46:13] (03PS8) 10Cathal Mooney: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) [16:46:43] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [16:46:44] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm [16:46:56] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [16:47:35] (03CR) 10CI reject: [V:04-1] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [16:48:48] (03PS9) 10Cathal Mooney: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) [16:48:49] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:48:53] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:49:00] (03CR) 10CDanis: [C:03+2] otelcol: fix service name munging post-Envoy upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187036 (https://phabricator.wikimedia.org/T380211) (owner: 10CDanis) [16:50:28] (03CR) 10Bking: [C:03+1] Temporarily exlude the 4 new hadoop workers to facilitate vlan change [puppet] - 10https://gerrit.wikimedia.org/r/1187029 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [16:50:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11168675 (10Jclark-ctr) @elukey I continued to get errors no root file system is defined when trying to boot from uefi [16:54:21] (03CR) 10CI reject: [V:04-1] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [16:54:24] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2005-dev.codfw.wmnet with OS bookworm [16:54:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P83173 and previous config saved to /var/cache/conftool/dbconfig/20250910-165436-ladsgroup.json [16:55:06] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [16:55:15] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [16:55:20] (03PS10) 10Cathal Mooney: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) [16:55:32] !log started single-replica PHP 8.3 pilot on shellbox-constraints - T403284 [16:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:36] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [16:56:52] (03CR) 10CI reject: [V:04-1] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [16:57:47] (03Merged) 10jenkins-bot: otelcol: fix service name munging post-Envoy upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187036 (https://phabricator.wikimedia.org/T380211) (owner: 10CDanis) [16:58:39] !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [16:59:09] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11168738 (10bking) @BTullis @Jclark-ctr Per @elukey 's comment, I'd also like to express my preference for using UEFI-DC Ops and IF are working to support the SuperMicro platform exclusively o... [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1700) [17:00:25] (03PS11) 10Cathal Mooney: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) [17:00:26] !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:04:58] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:09:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T402763)', diff saved to https://phabricator.wikimedia.org/P83174 and previous config saved to /var/cache/conftool/dbconfig/20250910-170944-ladsgroup.json [17:09:48] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:09:49] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [17:10:00] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2199.codfw.wmnet with reason: Maintenance [17:10:11] (03PS2) 10Anzx: Lift IP cap for workshop at University of Pretoria on 29-30 September [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187033 (https://phabricator.wikimedia.org/T404218) [17:10:13] !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:10:20] !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:10:23] (03CR) 10Dzahn: [C:03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1186609 (owner: 10Dzahn) [17:11:26] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [17:12:45] (03CR) 10Dzahn: [C:03+2] scap::master: Update advise in /srv/patches git pre-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1187011 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [17:13:30] (03CR) 10Ayounsi: [C:03+1] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [17:13:39] (03CR) 10Btullis: [C:03+2] Temporarily exlude the 4 new hadoop workers to facilitate vlan change [puppet] - 10https://gerrit.wikimedia.org/r/1187029 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [17:15:52] !log dropping user_autocreate_serial on sul wikis where empty (T397367) [17:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:56] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [17:19:22] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [17:23:01] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker[1233-1236].eqiad.wmnet [17:24:28] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11168866 (10KFrancis) Hi all, I am confirming the NDA is complete. Thanks! [17:26:42] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11168885 (10KFrancis) Hi all, I am confirming the NDA is complete. Thanks! [17:28:09] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2206.codfw.wmnet with reason: Maintenance [17:28:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2206 (T402763)', diff saved to https://phabricator.wikimedia.org/P83175 and previous config saved to /var/cache/conftool/dbconfig/20250910-172817-ladsgroup.json [17:28:23] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [17:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:34:11] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db2185.codfw.wmnet [17:34:54] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11168927 (10RLazarus) And by request from @CDanis, adding to this config update cycle: If `mesh.tracing.service_name` is set in values, pass it through in the `service_name` field of the `OpenTelemetryC... [17:35:20] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [17:36:22] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11168928 (10RLazarus) [17:36:26] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frmx2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403970#11168929 (10Papaul) 05Open→03Resolved Done on the switch side [17:37:15] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403965#11168933 (10Papaul) Done on the switch side [17:37:25] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet with OS bookworm [17:37:37] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11168934 (10CDanis) Oh, one other thing we might want to do, if mesh.tracing.service_name is unset, default it to `.Release.Namespace`. This is effectively what happens anyway in the big mess between En... [17:39:58] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2185.codfw.wmnet [17:40:47] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1256.eqiad.wmnet [17:41:05] btullis@cumin1003 decommission (PID 1983368) is awaiting input [17:41:07] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1256 - Upgrading db1256.eqiad.wmnet [17:41:34] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1256 - Upgrading db1256.eqiad.wmnet [17:41:40] (03PS1) 10CDanis: turnilo: re-add summed-up TTFB measure [puppet] - 10https://gerrit.wikimedia.org/r/1187048 [17:41:47] (03Abandoned) 10Bking: WIP: wdqs: Add alerts for no lag metrics reported [alerts] - 10https://gerrit.wikimedia.org/r/1130730 (https://phabricator.wikimedia.org/T389859) (owner: 10Bking) [17:45:15] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2006-dev.codfw.wmnet with OS bookworm [17:45:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:58] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1233-1236].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [17:47:11] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1256.eqiad.wmnet [17:47:32] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1256* gradually with 4 steps - Work done [17:47:34] (03CR) 10Cathal Mooney: [C:03+2] Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [17:48:40] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:48:58] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:49:02] btullis@cumin1003 decommission (PID 1983368) is awaiting input [17:49:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T402763)', diff saved to https://phabricator.wikimedia.org/P83178 and previous config saved to /var/cache/conftool/dbconfig/20250910-174903-ladsgroup.json [17:49:08] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [17:49:25] (03Merged) 10jenkins-bot: Redefine 'asns_mapping' to include additional bgp group metadata [homer/public] - 10https://gerrit.wikimedia.org/r/1186613 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [17:49:31] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:49:50] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:51:38] (03PS1) 10Jdlrobson: Enable search recommendation on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187052 (https://phabricator.wikimedia.org/T402048) [17:52:24] (03CR) 10CI reject: [V:04-1] Enable search recommendation on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187052 (https://phabricator.wikimedia.org/T402048) (owner: 10Jdlrobson) [17:52:56] (03PS1) 10Scott French: proton: persist increased replica count [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187053 (https://phabricator.wikimedia.org/T400131) [17:53:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1233-1236].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [17:53:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:53:12] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker[1233-1236].eqiad.wmnet [17:53:58] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: sync [17:54:06] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [17:56:19] (03CR) 10Scott French: [C:03+2] proton: persist increased replica count [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187053 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French) [17:57:58] (03Merged) 10jenkins-bot: proton: persist increased replica count [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187053 (https://phabricator.wikimedia.org/T400131) (owner: 10Scott French) [18:00:05] dduvall and dancy: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1800) [18:00:19] o/ [18:00:23] Loitering. [18:01:04] (03PS1) 10Dzahn: zuul::executor: systctl setting unprivileged_userns_clone needed [puppet] - 10https://gerrit.wikimedia.org/r/1187055 (https://phabricator.wikimedia.org/T403847) [18:01:59] dancy: o/ rolling in a sec [18:02:38] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187056 (https://phabricator.wikimedia.org/T396379) [18:02:41] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187056 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot) [18:02:43] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [18:03:39] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187056 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot) [18:04:06] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:04:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P83180 and previous config saved to /var/cache/conftool/dbconfig/20250910-180411-ladsgroup.json [18:06:14] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [18:08:24] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [18:09:36] (03CR) 10Scott French: [C:03+2] rest-gateway: Introduce rest-gateway-ro [puppet] - 10https://gerrit.wikimedia.org/r/1182852 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [18:10:07] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moving 4 servers to the analytics vlan - btullis@cumin1003" [18:10:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moving 4 servers to the analytics vlan - btullis@cumin1003" [18:10:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:10:58] 06SRE, 10bacula, 10Data-Persistence-Backup, 10Infrastructure Security, and 3 others: Trixie bacula-fd package incompatible with our bacula installation - https://phabricator.wikimedia.org/T404114#11169125 (10Dzahn) @jcrespo I tested a restore on people1005. Just selected 3 image files from my own home dir... [18:11:51] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1233 [18:13:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1233 [18:15:47] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.18 refs T396379 [18:15:52] T396379: 1.45.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T396379 [18:16:02] !log running puppet agent on A:dnsbox hosts - T400131 [18:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:06] T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131 [18:16:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:16:32] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:18:16] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1234 [18:19:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P83182 and previous config saved to /var/cache/conftool/dbconfig/20250910-181918-ladsgroup.json [18:19:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1234 [18:19:50] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1235 [18:20:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1235 [18:20:13] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1236 [18:20:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1236 [18:21:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:25:17] (03PS2) 10Jdlrobson: Enable search recommendation on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187052 (https://phabricator.wikimedia.org/T402048) [18:26:18] (03CR) 10CI reject: [V:04-1] Enable search recommendation on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187052 (https://phabricator.wikimedia.org/T402048) (owner: 10Jdlrobson) [18:26:37] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2006-dev.codfw.wmnet with OS bookworm [18:27:53] (03PS3) 10Jdlrobson: Enable search recommendation on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187052 (https://phabricator.wikimedia.org/T402048) [18:28:11] !log elevated error rate during wmf.18 group1 promotion. all were `$aspect must use one of the XXX_USAGE constants` error occurring from wmf.17 (cc T404238) [18:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:16] T404238: InvalidArgumentException: $aspect must use one of the XXX_USAGE constants, "A" given! - https://phabricator.wikimedia.org/T404238 [18:28:52] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=rest-gateway-ro,name=eqiad [reason: Pooling eqiad on new -ro service - T400131] [18:28:56] T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131 [18:28:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:14] (03CR) 10Cathal Mooney: [C:03+1] "LGTM thanks!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1186985 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi) [18:33:10] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1256* gradually with 4 steps - Work done [18:33:31] (03PS1) 10Bartosz Dziewoński: ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187062 (https://phabricator.wikimedia.org/T403519) [18:33:39] (03PS1) 10Bartosz Dziewoński: ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187063 (https://phabricator.wikimedia.org/T403519) [18:33:46] (03CR) 10Scott French: [C:03+2] wmnet: Introduce rest-gateway-ro [dns] - 10https://gerrit.wikimedia.org/r/1182853 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [18:33:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187062 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński) [18:34:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187063 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński) [18:34:26] !log swfrench@dns1004 START - running authdns-update [18:34:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T402763)', diff saved to https://phabricator.wikimedia.org/P83184 and previous config saved to /var/cache/conftool/dbconfig/20250910-183426-ladsgroup.json [18:34:32] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [18:34:42] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2210.codfw.wmnet with reason: Maintenance [18:34:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2210 (T402763)', diff saved to https://phabricator.wikimedia.org/P83185 and previous config saved to /var/cache/conftool/dbconfig/20250910-183449-ladsgroup.json [18:35:36] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11169332 (10vaughnwalters) [18:35:47] !log swfrench@dns1004 END - running authdns-update [18:35:52] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: Updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11169333 (10cmooney) >>! In T404146#11166358, @ayounsi wrote: > We already (and lazily) do : `min_value=0, max_value=48` which was to accommodate both SON... [18:36:28] (03PS1) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) [18:36:56] (03PS1) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) [18:37:04] (03PS2) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) [18:37:06] (03CR) 10CI reject: [V:04-1] Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [18:37:10] (03PS2) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) [18:37:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [18:37:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [18:38:40] (03CR) 10Lucas Werkmeister: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [18:39:01] !log ran authdns-update to add rest-gateway-ro and point rest-gateway at it - T400131 [18:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:04] T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131 [18:40:33] (03PS3) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) [18:40:39] (03PS3) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) [18:40:42] (03PS4) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) [18:41:43] (03CR) 10Bartosz Dziewoński: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [18:42:54] (03CR) 10Lucas Werkmeister: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [18:42:58] (03CR) 10Lucas Werkmeister: [C:03+1] Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [18:43:02] (03CR) 10Lucas Werkmeister: [C:03+1] Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [18:43:34] !log temporarily disabling puppet agent on A:dnsbox hosts - T400131 [18:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:12] (03CR) 10Scott French: [C:03+2] rest-gateway: Switch rest-gateway to A/P [puppet] - 10https://gerrit.wikimedia.org/r/1183084 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [18:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11169380 (10Jclark-ctr) @bking If you could update the Partman for EFI booting — it was originally set up for Legacy. I had requested the change to EFI booting, but it was failing, possibly due... [18:50:57] !log running puppet agent on A:dnsbox hosts - T400131 [18:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:02] T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131 [18:51:03] \m/ [18:51:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:56:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T402763)', diff saved to https://phabricator.wikimedia.org/P83186 and previous config saved to /var/cache/conftool/dbconfig/20250910-185611-ladsgroup.json [18:56:16] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [18:58:15] 06SRE, 06Infrastructure-Foundations, 10netops: Renumber link addressing to lsw1-e1-eqiad and lsw1-f1-eqiad - https://phabricator.wikimedia.org/T404248 (10cmooney) 03NEW p:05Triage→03Low [19:01:56] (03PS1) 10Andrew Bogott: eqiad cloudceph: move one osd and one mon node to version 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187070 (https://phabricator.wikimedia.org/T404249) [19:02:59] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187070 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [19:03:35] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11169471 (10Andrew) 05Open→03Resolved Everything is now on Quincy + Reef. [19:06:34] (03CR) 10Andrew Bogott: [C:03+2] eqiad cloudceph: move one osd and one mon node to version 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1187070 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [19:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:10:49] (03PS1) 10Dzahn: zuul: use variables to set path to zookeeper TLS certs in config [puppet] - 10https://gerrit.wikimedia.org/r/1187076 (https://phabricator.wikimedia.org/T403847) [19:11:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P83187 and previous config saved to /var/cache/conftool/dbconfig/20250910-191119-ladsgroup.json [19:11:55] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bookworm [19:12:23] (03PS2) 10Dzahn: zuul: use variables to set path to zookeeper TLS certs in config [puppet] - 10https://gerrit.wikimedia.org/r/1187076 (https://phabricator.wikimedia.org/T403847) [19:13:42] (03PS3) 10Dzahn: zuul: use variables to set path to zookeeper TLS certs in config [puppet] - 10https://gerrit.wikimedia.org/r/1187076 (https://phabricator.wikimedia.org/T403847) [19:13:47] (03CR) 10Dzahn: [C:03+2] zuul: use variables to set path to zookeeper TLS certs in config [puppet] - 10https://gerrit.wikimedia.org/r/1187076 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [19:15:41] (03PS1) 10Kosta Harlan: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) [19:16:52] (03PS1) 10CDobbins: admin: add mahmoud-abdelsattar to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187080 (https://phabricator.wikimedia.org/T403695) [19:17:34] (03PS2) 10Kosta Harlan: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) [19:20:04] https://spiderpig.wikimedia.org/jobs/541 <-- with the train promoted to group1, am I safe to deploy something right now? [19:21:07] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:22:31] (03CR) 10Dzahn: admin: add mahmoud-abdelsattar to ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187080 (https://phabricator.wikimedia.org/T403695) (owner: 10CDobbins) [19:24:57] (03CR) 10Dzahn: [C:03+2] "fine on executor nodes but needs follow-up on main nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1187076 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [19:25:07] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate new snippet files for reverse dns zones added for ssw1-d1-eqiad links - cmooney@cumin1003" [19:25:08] (03PS1) 10Cathal Mooney: Include statements for new netbox-generated snippet files [dns] - 10https://gerrit.wikimedia.org/r/1187081 (https://phabricator.wikimedia.org/T402588) [19:26:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P83188 and previous config saved to /var/cache/conftool/dbconfig/20250910-192626-ladsgroup.json [19:27:47] jouncebot: nowandnext [19:27:47] For the next 0 hour(s) and 32 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T1800) [19:27:47] In 0 hour(s) and 32 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T2000) [19:28:11] cmooney@cumin1003 netbox (PID 1994354) is awaiting input [19:28:42] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate new snippet files for reverse dns zones added for ssw1-d1-eqiad links - cmooney@cumin1003" [19:28:42] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:30:20] (03PS1) 10Dzahn: zuul::main: need to set TLS cert path variables before using them [puppet] - 10https://gerrit.wikimedia.org/r/1187083 (https://phabricator.wikimedia.org/T401614) [19:30:45] (03CR) 10Ssingh: [C:03+1] Include statements for new netbox-generated snippet files [dns] - 10https://gerrit.wikimedia.org/r/1187081 (https://phabricator.wikimedia.org/T402588) (owner: 10Cathal Mooney) [19:31:07] (03CR) 10Dzahn: [C:03+2] zuul::main: need to set TLS cert path variables before using them [puppet] - 10https://gerrit.wikimedia.org/r/1187083 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [19:31:36] (03CR) 10Cathal Mooney: [C:03+2] Include statements for new netbox-generated snippet files [dns] - 10https://gerrit.wikimedia.org/r/1187081 (https://phabricator.wikimedia.org/T402588) (owner: 10Cathal Mooney) [19:32:07] !log cmooney@dns2005 START - running authdns-update [19:33:24] (03CR) 10Bartosz Dziewoński: [C:03+1] build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185156 (https://phabricator.wikimedia.org/T403781) (owner: 10Umherirrender) [19:33:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185156 (https://phabricator.wikimedia.org/T403781) (owner: 10Umherirrender) [19:33:34] !log cmooney@dns2005 END - running authdns-update [19:41:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T402763)', diff saved to https://phabricator.wikimedia.org/P83189 and previous config saved to /var/cache/conftool/dbconfig/20250910-194134-ladsgroup.json [19:41:43] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [19:41:53] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2219.codfw.wmnet with reason: Maintenance [19:41:56] (03PS1) 10Reedy: HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187087 (https://phabricator.wikimedia.org/T404252) [19:42:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2219 (T402763)', diff saved to https://phabricator.wikimedia.org/P83190 and previous config saved to /var/cache/conftool/dbconfig/20250910-194200-ladsgroup.json [19:42:03] (03CR) 10Reedy: [C:03+2] HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187087 (https://phabricator.wikimedia.org/T404252) (owner: 10Reedy) [19:42:10] (03PS1) 10Reedy: HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187088 (https://phabricator.wikimedia.org/T404252) [19:42:16] (03CR) 10Reedy: [C:03+2] HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187088 (https://phabricator.wikimedia.org/T404252) (owner: 10Reedy) [19:42:20] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [19:44:15] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:44:15] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [19:44:32] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:48:56] 06SRE, 06Data-Engineering, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live (is_pageview is always "-") - https://phabricator.wikimedia.org/T402612#11169688 (10JAllemandou) Building a JAR containing the `is_pageview` logic with a few dependencies as possible is easy. My little researc... [19:49:15] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [19:50:15] cmooney@cumin1003 netbox (PID 1996852) is awaiting input [19:51:25] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns for new IPs for ssw1-d1-eqiad - cmooney@cumin1003" [19:52:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns for new IPs for ssw1-d1-eqiad - cmooney@cumin1003" [19:52:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:53:53] (03Merged) 10jenkins-bot: HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187087 (https://phabricator.wikimedia.org/T404252) (owner: 10Reedy) [19:55:00] (03PS4) 10NMW03: Add rights to bypass spam blacklists for azwiki sysops and interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) [19:55:39] jouncebot: next [19:55:39] In 0 hour(s) and 4 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T2000) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T2000). [20:00:05] Nemoralis and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] o/ [20:00:25] hi [20:00:29] (03CR) 10Scott French: [C:03+1] "Thanks for highlighting the source of the policy change adding the port." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [20:00:32] (03Merged) 10jenkins-bot: HookHandler: Do a CentralID lookup directly [extensions/OATHAuth] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187088 (https://phabricator.wikimedia.org/T404252) (owner: 10Reedy) [20:01:04] Who is doing the deployment? Or have I put myself in the position that it's me? :P [20:01:25] not me of course ;P [20:01:37] heh. looks like it's you Reedy, thanks ;) [20:01:40] (03CR) 10Reedy: [C:03+2] ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187062 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński) [20:01:41] (03CR) 10Reedy: [C:03+2] ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187063 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński) [20:02:08] (03CR) 10Reedy: [C:03+2] build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185156 (https://phabricator.wikimedia.org/T403781) (owner: 10Umherirrender) [20:02:36] MatmaRex: for your session handling changes, do you need to test those and/or is there a reasonable chance of needing to revert? :P [20:02:53] I think I'll probably do the rest in one go, and then do those two after... [20:03:19] (03Merged) 10jenkins-bot: build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185156 (https://phabricator.wikimedia.org/T403781) (owner: 10Umherirrender) [20:03:36] (03CR) 10Reedy: [C:03+2] Add rights to bypass spam blacklists for azwiki sysops and interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) (owner: 10NMW03) [20:04:11] Reedy: the backports are not risky. the config changes depend on the backport and are a little bit more risky. i can test once the config changes are on mwdebug. [20:04:41] Do you want those config ones seperately? Or do the lot in one go? [20:04:45] seperately/after [20:05:37] (03Merged) 10jenkins-bot: Add rights to bypass spam blacklists for azwiki sysops and interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) (owner: 10NMW03) [20:05:41] Reedy: if you can do the backports and session config at the same time, that would be fine. we should probably do Nemoralis's config and the codesniffer one separately first [20:06:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T402763)', diff saved to https://phabricator.wikimedia.org/P83192 and previous config saved to /var/cache/conftool/dbconfig/20250910-200612-ladsgroup.json [20:06:18] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [20:07:09] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1187087|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1187088|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1175222|Add rights to bypass spam blacklists for azwiki sysops and interface-admins (T400428)]], [[gerrit:1185156|build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 (T403781)]] [20:07:18] T404252: OATHAuth loading more from the database than needed - https://phabricator.wikimedia.org/T404252 [20:07:18] T400428: Addition of "sboverride" and "abusefilter-bypass-blocked-external-domains" rights for azwiki sysops, interface admins and bureaucrats - https://phabricator.wikimedia.org/T400428 [20:07:19] T403781: MediaWiki.NamingConventions.ValidGlobalName: Stop accepting a comma-separated string for values, deprecated upstream - https://phabricator.wikimedia.org/T403781 [20:08:26] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1004.eqiad.wmnet with OS bookworm [20:10:43] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1016.eqiad.wmnet with OS bookworm [20:11:11] (03CR) 10Scott French: [C:03+1] {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186676 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [20:12:33] (03PS2) 10CDobbins: admin: add mahmoud-abdelsattar to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187080 (https://phabricator.wikimedia.org/T403695) [20:13:32] !log reedy@deploy1003 reedy, umherirrender, nmw03: Backport for [[gerrit:1187087|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1187088|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1175222|Add rights to bypass spam blacklists for azwiki sysops and interface-admins (T400428)]], [[gerrit:1185156|build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 (T403781)]] synced to the testse [20:13:32] rvers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:13:39] Nemoralis: ^ Do you want to test? I'm fine if you don't want to [20:13:39] T404252: OATHAuth loading more from the database than needed - https://phabricator.wikimedia.org/T404252 [20:13:39] T400428: Addition of "sboverride" and "abusefilter-bypass-blocked-external-domains" rights for azwiki sysops, interface admins and bureaucrats - https://phabricator.wikimedia.org/T400428 [20:13:40] T403781: MediaWiki.NamingConventions.ValidGlobalName: Stop accepting a comma-separated string for values, deprecated upstream - https://phabricator.wikimedia.org/T403781 [20:14:16] Reedy: tested, LGTM [20:14:39] !log reedy@deploy1003 reedy, umherirrender, nmw03: Continuing with sync [20:15:02] (03PS1) 10Cathal Mooney: Nokia: EBGP configuration base build [homer/public] - 10https://gerrit.wikimedia.org/r/1187092 (https://phabricator.wikimedia.org/T402577) [20:16:30] (03CR) 10CI reject: [V:04-1] Nokia: EBGP configuration base build [homer/public] - 10https://gerrit.wikimedia.org/r/1187092 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [20:16:34] (03CR) 10RLazarus: [C:03+2] "Thanks! Verified that none of these charts have been bumped at master in the meantime, so this is conflict-free." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [20:16:39] (03CR) 10CDobbins: admin: add mahmoud-abdelsattar to ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187080 (https://phabricator.wikimedia.org/T403695) (owner: 10CDobbins) [20:17:15] (03Merged) 10jenkins-bot: ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1187062 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński) [20:17:50] (03PS2) 10Cathal Mooney: Nokia: EBGP configuration base build [homer/public] - 10https://gerrit.wikimedia.org/r/1187092 (https://phabricator.wikimedia.org/T402577) [20:19:02] (03Merged) 10jenkins-bot: ApiQueryTokens: Persist any new token, instead of depending on the type [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187063 (https://phabricator.wikimedia.org/T403519) (owner: 10Bartosz Dziewoński) [20:19:58] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187087|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1187088|HookHandler: Do a CentralID lookup directly (T404252)]], [[gerrit:1175222|Add rights to bypass spam blacklists for azwiki sysops and interface-admins (T400428)]], [[gerrit:1185156|build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 (T403781)]] (duration: 12m 48s) [20:19:58] (03CR) 10Reedy: [C:03+2] Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [20:19:59] (03CR) 10Scott French: [C:03+2] wmnet: Switch rest-gateway to metafo [dns] - 10https://gerrit.wikimedia.org/r/1183085 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [20:20:00] (03CR) 10Reedy: [C:03+2] Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [20:20:05] T404252: OATHAuth loading more from the database than needed - https://phabricator.wikimedia.org/T404252 [20:20:05] T400428: Addition of "sboverride" and "abusefilter-bypass-blocked-external-domains" rights for azwiki sysops, interface admins and bureaucrats - https://phabricator.wikimedia.org/T400428 [20:20:06] T403781: MediaWiki.NamingConventions.ValidGlobalName: Stop accepting a comma-separated string for values, deprecated upstream - https://phabricator.wikimedia.org/T403781 [20:20:55] (03Merged) 10jenkins-bot: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187065 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [20:20:55] (03PS2) 10Clément Goubert: wmnet: Switch rest-gateway to metafo [dns] - 10https://gerrit.wikimedia.org/r/1183085 (https://phabricator.wikimedia.org/T402412) [20:20:58] (03Merged) 10jenkins-bot: Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187066 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [20:20:59] (I’m just here so I can shout complaints if the session changes break my tools again ;P) [20:21:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P83193 and previous config saved to /var/cache/conftool/dbconfig/20250910-202119-ladsgroup.json [20:21:28] hi lucaswerkmeister, thanks :) [20:21:45] (03CR) 10Scott French: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1183085 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [20:23:15] (03CR) 10Cathal Mooney: Nokia: EBGP configuration base build (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1187092 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [20:23:30] !log swfrench@dns1004 START - running authdns-update [20:24:06] dancy: About? [20:24:15] Hola [20:24:23] scap is doing something weird [20:24:28] and macos terminal just crashed again [20:24:45] !log swfrench@dns1004 END - running authdns-update [20:25:01] reedy@deploy1003:/srv/mediawiki-staging$ scap backport 1187062 1187063 1187066 1187065 [20:25:01] 20:23:37 Checking whether requested changes are in a branch deployed to production and their dependencies valid... [20:25:01] 20:23:41 Change '1187062' validated for backport [20:25:01] 20:23:44 Change '1187063' validated for backport [20:25:01] Change '1186595', project 'mediawiki/core', branch 'master' not found in any deployed wikiversion. Deployed wikiversions: ['1.45.0-wmf.17', '1.45.0-wmf.18'] [20:25:02] Continue with backport? [y/N]: [20:25:08] Why is it trying to do 1186595? [20:25:14] !log ran authdns-update to convert rest-gateway to active/passive - T400131 [20:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:17] T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131 [20:25:19] hmm [20:25:19] Depends-On? [20:25:26] Is must be a Depends-On in one of the primary commits [20:25:29] probably because lucaswerkmeister made me add Depends-On ;) [20:25:40] Now we can't deploy because scap says no :P [20:25:48] You can answer yes [20:26:04] Do you know if there's a bug about this, or shall I file one? [20:26:11] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=rest-gateway-ro,name=eqiad [reason: Pooling codfw on new -ro service - T400131] [20:26:13] I believe there is one. I'll look it up [20:26:15] As all three commits with the change-id are merged, so it shouldn't really care [20:26:16] thanks [20:26:32] the bug would presumably be T388025 [20:26:32] T388025: scap complaining about dependency which is already merged - https://phabricator.wikimedia.org/T388025 [20:26:36] cc dancy [20:26:44] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=rest-gateway-ro,name=codfw [reason: Pooling codfw on new -ro service - T400131] [20:26:45] That's the one. [20:26:54] I had only seen the opposite of it, T397931, which is why I asked to add the Depends-On [20:26:55] T397931: scap not complaining about dependencies only partially deployed with the train - https://phabricator.wikimedia.org/T397931 [20:26:58] good to know it’s broken both ways [20:26:58] yeah, the changes to deploy are correct [20:27:00] (03Merged) 10jenkins-bot: all charts: Update mesh.configuration 1.13.0 to 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [20:27:06] heh [20:27:24] dippy bird to press Y [20:27:35] well, good to know for sure that we shouldn't use Depends-On in operations/mediawiki-config [20:27:38] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1187062|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187063|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187066|Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324)]], [[gerrit:1187065|Revert^2 "Set $wgPHPSessionHandling to 'disable' [20:27:38] on group0 wikis" (T362324)]] [20:27:43] T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519 [20:27:44] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [20:29:17] (03PS1) 10CDobbins: admin: add johannesrichterwmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187094 (https://phabricator.wikimedia.org/T404080) [20:32:07] !log reedy@deploy1003 reedy, matmarex: Backport for [[gerrit:1187062|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187063|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187066|Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324)]], [[gerrit:1187065|Revert^2 "Set $wgPHPSessionHandling to 'disable' on grou [20:32:07] p0 wikis" (T362324)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:32:20] MatmaRex: lucaswerkmeister ^ have at it [20:33:47] oh, i just realized i can't make an oauth app use the test servers [20:34:06] so… i guess i'll test once it's live? i was using https://oauth-hello-world.toolforge.org/ [20:34:30] heh, want me to just ship it then? [20:35:10] normal logins and edits work fine [20:35:12] yeahhh [20:36:08] I mean, you could probably make it send the right X-Wikimedia-Debug headers somewhere in the innards of the code [20:36:11] but yeah just syncing sounds okay too [20:36:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P83194 and previous config saved to /var/cache/conftool/dbconfig/20250910-203626-ladsgroup.json [20:36:34] !log reedy@deploy1003 reedy, matmarex: Continuing with sync [20:36:35] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: host reimage [20:40:07] 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11169880 (10Catrope) [20:40:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: host reimage [20:41:47] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187062|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187063|ApiQueryTokens: Persist any new token, instead of depending on the type (T403519)]], [[gerrit:1187066|Revert^2 "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324)]], [[gerrit:1187065|Revert^2 "Set $wgPHPSessionHandling to 'disable [20:41:48] ' on group0 wikis" (T362324)]] (duration: 14m 09s) [20:41:53] T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519 [20:41:53] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [20:42:39] it works now :) [20:42:42] looks like edits are still working \o/ https://test.wikidata.org/w/index.php?title=Lexeme:L123&diff=prev&oldid=738333 [20:43:05] https://test.wikipedia.org/w/index.php?title=User_talk:Matma_Rex&diff=prev&oldid=673772 [20:43:11] thanks Reedy [20:51:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T402763)', diff saved to https://phabricator.wikimedia.org/P83195 and previous config saved to /var/cache/conftool/dbconfig/20250910-205134-ladsgroup.json [20:51:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2236.codfw.wmnet with reason: Maintenance [20:51:40] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [20:51:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2236 (T402763)', diff saved to https://phabricator.wikimedia.org/P83196 and previous config saved to /var/cache/conftool/dbconfig/20250910-205146-ladsgroup.json [20:52:48] (03PS3) 10Bartosz Dziewoński: Set $wgPHPSessionHandling to 'disable' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (owner: 10Gergő Tisza) [20:52:56] (03CR) 10CI reject: [V:04-1] Set $wgPHPSessionHandling to 'disable' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (owner: 10Gergő Tisza) [20:53:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [20:53:30] ^ expected due to news [20:54:03] (03PS4) 10Bartosz Dziewoński: Set $wgPHPSessionHandling to 'disable' on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [20:54:07] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [20:56:11] likely due to ne—frick [20:57:09] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1016.eqiad.wmnet with OS bookworm [20:58:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [20:59:07] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T2100) [21:06:16] (03CR) 10RLazarus: [C:03+2] {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186676 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [21:07:57] (03Merged) 10jenkins-bot: {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186676 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [21:09:27] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/apertium: apply [21:09:36] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/apertium: apply [21:10:23] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/apertium: apply [21:11:01] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/apertium: apply [21:11:15] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/apertium: apply [21:11:56] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [21:15:09] (03PS1) 10SBassett: Optionally encrypt OTP secret in the database [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187101 (https://phabricator.wikimedia.org/T145915) [21:16:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187101 (https://phabricator.wikimedia.org/T145915) (owner: 10SBassett) [21:16:17] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:17:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T402763)', diff saved to https://phabricator.wikimedia.org/P83197 and previous config saved to /var/cache/conftool/dbconfig/20250910-211659-ladsgroup.json [21:17:04] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [21:17:21] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [21:18:17] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [21:18:33] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [21:18:35] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [21:24:57] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11170002 (10RLazarus) [21:25:13] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:28:11] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [21:28:28] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [21:30:16] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [21:30:17] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [21:30:25] (03CR) 10Scott French: [C:03+1] "Ah, that's an interesting use case, and agreed that the aggregated TTFB is meaningful in that context. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1187048 (owner: 10CDanis) [21:32:06] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [21:32:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P83198 and previous config saved to /var/cache/conftool/dbconfig/20250910-213207-ladsgroup.json [21:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:35:27] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:17] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:39:03] (03PS1) 10Ebernhardson: cirrus: Start AB test of did-you-mean profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187108 (https://phabricator.wikimedia.org/T390858) [21:45:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:52] (03PS1) 10Bking: opensearch-operator: point to correct image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187109 (https://phabricator.wikimedia.org/T397246) [21:47:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P83199 and previous config saved to /var/cache/conftool/dbconfig/20250910-214714-ladsgroup.json [21:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:53:47] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11170105 (10Jdlrobson-WMF) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250910T2200) [22:01:18] (03PS1) 10Andrew Bogott: cloudcephosd1016.eqiad.wmnet: update network names for Boookworm [puppet] - 10https://gerrit.wikimedia.org/r/1187111 [22:01:54] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1016.eqiad.wmnet: update network names for Boookworm [puppet] - 10https://gerrit.wikimedia.org/r/1187111 (owner: 10Andrew Bogott) [22:02:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T402763)', diff saved to https://phabricator.wikimedia.org/P83200 and previous config saved to /var/cache/conftool/dbconfig/20250910-220222-ladsgroup.json [22:02:27] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [22:02:38] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2237.codfw.wmnet with reason: Maintenance [22:02:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2237 (T402763)', diff saved to https://phabricator.wikimedia.org/P83201 and previous config saved to /var/cache/conftool/dbconfig/20250910-220245-ladsgroup.json [22:04:06] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:16:32] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:18:42] (03CR) 10Dzahn: [C:03+1] admin: add mahmoud-abdelsattar to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187080 (https://phabricator.wikimedia.org/T403695) (owner: 10CDobbins) [22:26:42] (03CR) 10Dzahn: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1187020/6884/people1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [22:27:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T402763)', diff saved to https://phabricator.wikimedia.org/P83202 and previous config saved to /var/cache/conftool/dbconfig/20250910-222738-ladsgroup.json [22:27:43] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [22:28:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:42:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P83203 and previous config saved to /var/cache/conftool/dbconfig/20250910-224246-ladsgroup.json [22:51:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:55:48] (03PS1) 10RLazarus: envoyproxy: Remove lua_script param [puppet] - 10https://gerrit.wikimedia.org/r/1187126 (https://phabricator.wikimedia.org/T404036) [22:57:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P83204 and previous config saved to /var/cache/conftool/dbconfig/20250910-225753-ladsgroup.json [23:01:01] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6885/console" [puppet] - 10https://gerrit.wikimedia.org/r/1187126 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:01:43] (03CR) 10RLazarus: envoyproxy: Remove lua_script param [puppet] - 10https://gerrit.wikimedia.org/r/1187126 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:08:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Reboot', diff saved to https://phabricator.wikimedia.org/P83205 and previous config saved to /var/cache/conftool/dbconfig/20250910-230823-ladsgroup.json [23:08:35] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1257.eqiad.wmnet [23:08:54] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1257 - Upgrading db1257.eqiad.wmnet [23:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:09:01] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1257 - Upgrading db1257.eqiad.wmnet [23:13:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T402763)', diff saved to https://phabricator.wikimedia.org/P83206 and previous config saved to /var/cache/conftool/dbconfig/20250910-231301-ladsgroup.json [23:13:06] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [23:13:17] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2239.codfw.wmnet with reason: Maintenance [23:14:00] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1257.eqiad.wmnet [23:15:03] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1257* gradually with 4 steps - Work done [23:15:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:16:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:18:19] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1187131 (https://phabricator.wikimedia.org/T404274) [23:18:24] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1187132 (https://phabricator.wikimedia.org/T404274) [23:25:51] !log sudo -i reprepro -C main includedeb bullseye-wikimedia /srv/wikimedia/pool/component/envoy-future/e/envoyproxy/envoyproxy_1.29.12-1_amd64.deb # T403663 [23:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:56] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [23:26:05] !log sudo -i reprepro copy bookworm-wikimedia bullseye-wikimedia envoyproxy # T403663 [23:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:12] !log sudo -i reprepro copy trixie-wikimedia bullseye-wikimedia envoyproxy # T403663 [23:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:32] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking), 10Wikipedia-iOS-App-Backlog (iOS Release FY2025-26): [QA Task] Verify iOS compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404275 (10Seddon) 03NEW [23:28:44] (03PS1) 10RLazarus: envoy: Update to v1.29.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187134 (https://phabricator.wikimedia.org/T403663) [23:28:48] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1193 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1187135 (https://phabricator.wikimedia.org/T404277) [23:28:53] (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1187136 (https://phabricator.wikimedia.org/T404277) [23:28:58] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): [QA Task] Verify Android app compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404276 (10Seddon) 03NEW [23:31:05] (03CR) 10RLazarus: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187134 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [23:31:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 5.379 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:33:36] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T404277 [23:33:40] T404277: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T404277 [23:34:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set db1193 with weight 0 T404277', diff saved to https://phabricator.wikimedia.org/P83209 and previous config saved to /var/cache/conftool/dbconfig/20250910-233428-ladsgroup.json [23:37:59] (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db1193 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1187135 (https://phabricator.wikimedia.org/T404277) (owner: 10Gerrit maintenance bot) [23:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1187141 [23:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1187141 (owner: 10TrainBranchBot) [23:38:48] !log Starting s8 eqiad failover from db1209 to db1193 - T404277 [23:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:52] T404277: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T404277 [23:39:00] (03CR) 10Scott French: [C:03+1] envoy: Update to v1.29.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187134 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [23:39:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T404277', diff saved to https://phabricator.wikimedia.org/P83210 and previous config saved to /var/cache/conftool/dbconfig/20250910-233902-ladsgroup.json [23:39:36] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2240.codfw.wmnet with reason: Maintenance [23:39:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2240 (T402763)', diff saved to https://phabricator.wikimedia.org/P83211 and previous config saved to /var/cache/conftool/dbconfig/20250910-233943-ladsgroup.json [23:39:50] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [23:40:03] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.011 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:40:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote db1193 to s8 primary and set section read-write T404277', diff saved to https://phabricator.wikimedia.org/P83212 and previous config saved to /var/cache/conftool/dbconfig/20250910-234049-ladsgroup.json [23:41:11] (03CR) 10RLazarus: [V:03+2 C:03+2] envoy: Update to v1.29.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187134 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [23:42:57] (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1187136 (https://phabricator.wikimedia.org/T404277) (owner: 10Gerrit maintenance bot) [23:43:14] !log ladsgroup@dns1004 START - running authdns-update [23:44:24] !log ladsgroup@dns1004 END - running authdns-update [23:44:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1209 T404277', diff saved to https://phabricator.wikimedia.org/P83213 and previous config saved to /var/cache/conftool/dbconfig/20250910-234456-ladsgroup.json [23:45:00] T404277: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T404277 [23:46:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:47:03] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1209.eqiad.wmnet [23:47:11] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1209 - Upgrading db1209.eqiad.wmnet [23:47:18] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1209 - Upgrading db1209.eqiad.wmnet [23:53:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1187141 (owner: 10TrainBranchBot) [23:55:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:58:00] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1209.eqiad.wmnet [23:58:53] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1209* gradually with 4 steps - Work done