[00:00:48] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1183700|Disable wmgUseMdotRouting on testwiki in prod (T401595)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:01:43] !log krinkle@deploy1003 krinkle: Continuing with sync [00:06:12] (03PS1) 10Andrea Denisse: alert: Add slack_bot_token Bug: T401730 [labs/private] - 10https://gerrit.wikimedia.org/r/1184613 (https://phabricator.wikimedia.org/T401730) [00:06:58] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183700|Disable wmgUseMdotRouting on testwiki in prod (T401595)]] (duration: 09m 30s) [00:07:02] T401595: [Rollout Phase 1] Implement unified mobile routing and enable on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595 [00:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:08:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184614 [00:08:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184614 (owner: 10TrainBranchBot) [00:12:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:14:55] (03PS1) 10Andrea Denisse: alert: Add Slack route to send Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) [00:14:55] (03CR) 10Andrea Denisse: "Hi folks, I tested this in Pontoon by sending alerts to the #api-alerts and #api-alerts-test Slack channels." [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [00:16:39] (03CR) 10Andrea Denisse: [C:03+2] alert: Add Slack route to send Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [00:16:49] (03CR) 10Andrea Denisse: alert: Add Slack route to send Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [00:17:13] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] alert: Add slack_bot_token Bug: T401730 [labs/private] - 10https://gerrit.wikimedia.org/r/1184613 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [00:23:09] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184614 (owner: 10TrainBranchBot) [00:26:13] (03CR) 10RLazarus: [C:03+2] api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182674 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:28:17] (03Merged) 10jenkins-bot: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182674 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:29:30] (03PS1) 10Jforrester: EditAttemptStep: don't error if something is blocking session logging [extensions/WikimediaEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184618 (https://phabricator.wikimedia.org/T403656) [00:29:40] (03PS1) 10Jforrester: EditAttemptStep: don't error if something is blocking session logging [extensions/WikimediaEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184619 (https://phabricator.wikimedia.org/T403656) [00:36:43] (03PS3) 10RLazarus: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182675 (https://phabricator.wikimedia.org/T403101) [00:37:34] 06SRE, 06Traffic, 10MediaWiki-Platform-Team (Radar): Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11146274 (10Krinkle) [00:38:57] (03CR) 10RLazarus: [C:03+2] api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182675 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:40:00] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-09-02-205403 to 2025-09-04-003606 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184620 (https://phabricator.wikimedia.org/T397956) [00:40:42] (03Merged) 10jenkins-bot: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182675 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:40:50] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-09-02-205403 to 2025-09-04-003606 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184620 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester) [00:41:33] (03PS3) 10RLazarus: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182676 (https://phabricator.wikimedia.org/T403101) [00:42:39] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-09-02-205403 to 2025-09-04-003606 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184620 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester) [00:43:56] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [00:44:00] (03CR) 10RLazarus: [C:03+2] api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182676 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:44:11] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [00:44:28] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [00:45:01] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [00:45:06] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [00:45:39] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [00:46:05] (03Merged) 10jenkins-bot: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182676 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [00:58:14] Heads up that I'm about to deploy two backports for an UBN. [00:59:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184618 (https://phabricator.wikimedia.org/T403656) (owner: 10Jforrester) [00:59:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184619 (https://phabricator.wikimedia.org/T403656) (owner: 10Jforrester) [01:00:20] (03Merged) 10jenkins-bot: EditAttemptStep: don't error if something is blocking session logging [extensions/WikimediaEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184618 (https://phabricator.wikimedia.org/T403656) (owner: 10Jforrester) [01:00:40] (03Merged) 10jenkins-bot: EditAttemptStep: don't error if something is blocking session logging [extensions/WikimediaEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184619 (https://phabricator.wikimedia.org/T403656) (owner: 10Jforrester) [01:01:11] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1184618|EditAttemptStep: don't error if something is blocking session logging (T403656)]], [[gerrit:1184619|EditAttemptStep: don't error if something is blocking session logging (T403656)]] [01:01:15] T403656: "Invalid response from server" rarely appearing when attempting to save edits - https://phabricator.wikimedia.org/T403656 [01:01:37] Kemayo you're my hero <3 [01:02:18] perryprog: it was a pretty bad one! Though also specific enough to trigger that I can see why we didn't catch it. [01:02:53] it's been driving me nuts since I exclusively use the 2017 editor too. :D [01:04:04] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:06:13] !log kemayo@deploy1003 jforrester, kemayo: Backport for [[gerrit:1184618|EditAttemptStep: don't error if something is blocking session logging (T403656)]], [[gerrit:1184619|EditAttemptStep: don't error if something is blocking session logging (T403656)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:06:16] T403656: "Invalid response from server" rarely appearing when attempting to save edits - https://phabricator.wikimedia.org/T403656 [01:08:00] !log kemayo@deploy1003 jforrester, kemayo: Continuing with sync [01:13:18] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184618|EditAttemptStep: don't error if something is blocking session logging (T403656)]], [[gerrit:1184619|EditAttemptStep: don't error if something is blocking session logging (T403656)]] (duration: 12m 07s) [01:13:22] T403656: "Invalid response from server" rarely appearing when attempting to save edits - https://phabricator.wikimedia.org/T403656 [01:14:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T402925)', diff saved to https://phabricator.wikimedia.org/P82515 and previous config saved to /var/cache/conftool/dbconfig/20250904-011407-ladsgroup.json [01:14:11] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:29:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P82516 and previous config saved to /var/cache/conftool/dbconfig/20250904-012914-ladsgroup.json [01:33:56] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:35:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:44:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P82517 and previous config saved to /var/cache/conftool/dbconfig/20250904-014422-ladsgroup.json [01:48:56] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:59:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T402925)', diff saved to https://phabricator.wikimedia.org/P82518 and previous config saved to /var/cache/conftool/dbconfig/20250904-015929-ladsgroup.json [01:59:34] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:59:45] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1249.eqiad.wmnet with reason: Maintenance [01:59:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1249 (T402925)', diff saved to https://phabricator.wikimedia.org/P82519 and previous config saved to /var/cache/conftool/dbconfig/20250904-015952-ladsgroup.json [02:29:09] (03Abandoned) 10Samwilson: CommonSettings: Add CommunityRequests projects and group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184375 (https://phabricator.wikimedia.org/T393860) (owner: 10Samwilson) [02:33:56] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [02:48:56] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:02:23] (03CR) 10Mmta: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [03:03:56] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:56] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:17:08] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:28:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:29:48] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:32:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:36:50] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:32:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T402925)', diff saved to https://phabricator.wikimedia.org/P82520 and previous config saved to /var/cache/conftool/dbconfig/20250904-043220-ladsgroup.json [04:32:25] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:47:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P82521 and previous config saved to /var/cache/conftool/dbconfig/20250904-044728-ladsgroup.json [04:53:56] RESOLVED: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [05:02:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P82522 and previous config saved to /var/cache/conftool/dbconfig/20250904-050235-ladsgroup.json [05:04:04] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:17:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T402925)', diff saved to https://phabricator.wikimedia.org/P82523 and previous config saved to /var/cache/conftool/dbconfig/20250904-051743-ladsgroup.json [05:17:47] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [05:17:59] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1252.eqiad.wmnet with reason: Maintenance [05:18:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T402925)', diff saved to https://phabricator.wikimedia.org/P82524 and previous config saved to /var/cache/conftool/dbconfig/20250904-051806-ladsgroup.json [05:33:56] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:33:56] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:48:56] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:55:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184549 (https://phabricator.wikimedia.org/T402369) (owner: 10Phuedx) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T0600) [06:00:05] marostegui, Amir1, and federico3: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T0600). [06:23:41] jouncebot next [06:23:41] In 0 hour(s) and 36 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T0700) [06:24:17] I'll be 5-10 minutes late to the UTC morning backport window :) [06:35:04] (03CR) 10Arnaudb: [C:03+2] "Done" [alerts] - 10https://gerrit.wikimedia.org/r/1184378 (owner: 10Arnaudb) [06:36:27] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [06:37:00] (03Merged) 10jenkins-bot: gitlab: alert on sidekiq queue piling up [alerts] - 10https://gerrit.wikimedia.org/r/1184378 (owner: 10Arnaudb) [06:42:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [06:43:03] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11146541 (10VRiley-WMF) [06:43:50] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11146542 (10VRiley-WMF) I have noticed some of these cables are active and currently connected. [06:45:21] (03CR) 10Muehlenhoff: [C:03+1] "This would work, alternative fix inline" [puppet] - 10https://gerrit.wikimedia.org/r/1184510 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [06:46:10] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [06:48:48] (03PS1) 10Muehlenhoff: Remove LDAP access for astinson [puppet] - 10https://gerrit.wikimedia.org/r/1184641 [06:48:56] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:57:53] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for astinson [puppet] - 10https://gerrit.wikimedia.org/r/1184641 (owner: 10Muehlenhoff) [06:58:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:00:04] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T0700). [07:00:05] abijeet and phuedx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11146590 (10elukey) >>! In T392851#11144994, @RobH wrote: > cp2045 has had the idrac, bios, and SSD firmware updated to latest revisions to match cp2043. > > P... [07:01:08] abijeet will be late for the deployment. [07:01:21] phuedx: you can start with your patch. [07:02:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3005.esams.wmnet with OS bookworm [07:03:56] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:56] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:13:34] kart_: o/ I'm back [07:13:41] kart_: Are you deploying your patch? [07:20:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184549 (https://phabricator.wikimedia.org/T402369) (owner: 10Phuedx) [07:20:56] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 5400 [07:21:23] (03Merged) 10jenkins-bot: MetricsPlatform: Enable overrides everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184549 (https://phabricator.wikimedia.org/T402369) (owner: 10Phuedx) [07:21:25] hello [07:21:52] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1184549|MetricsPlatform: Enable overrides everywhere (T402369)]] [07:21:55] T402369: Send analytics events for overridden experiments to the console - https://phabricator.wikimedia.org/T402369 [07:22:15] abijeet: hola [07:23:03] (03PS1) 10Abijeet Patro: Revert^2 "TranslationUnitDTO: Make blob type properties writable" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184645 [07:23:30] kart_, 1184645: Revert^2 "TranslationUnitDTO: Make blob type properties writable" | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1184645 [07:23:37] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage [07:23:53] abijeet: thanks. I was about to ping for that. [07:24:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184645 (owner: 10Abijeet Patro) [07:24:18] kart_, I've scheduled it for deployment [07:26:18] thanks. let's wait for CI [07:26:30] and we can do that after phuedx's deployment is done. [07:26:43] !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1184549|MetricsPlatform: Enable overrides everywhere (T402369)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:26:51] Verifying [07:26:52] kart_, thanks [07:29:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage [07:30:01] (03CR) 10Filippo Giunchedi: interface: create rt_tables.d as needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184510 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [07:30:48] (03PS1) 10Slyngshede: P:puppetserver::volatile avoid loading Spur data on certain host [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) [07:31:17] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 5400 [07:32:06] LGTM. The appropriate methods are available in the JS SDK. I tested overriding as a logged-in user on the Beta Cluster and on a production wiki. Continuing [07:32:10] !log phuedx@deploy1003 phuedx: Continuing with sync [07:33:24] (03PS2) 10Filippo Giunchedi: interface: create rt_tables.d as needed [puppet] - 10https://gerrit.wikimedia.org/r/1184510 (https://phabricator.wikimedia.org/T401899) [07:33:24] (03PS2) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) [07:33:25] (03CR) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [07:35:00] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6842/console" [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [07:37:02] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6843/console" [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [07:37:25] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184549|MetricsPlatform: Enable overrides everywhere (T402369)]] (duration: 15m 33s) [07:37:29] T402369: Send analytics events for overridden experiments to the console - https://phabricator.wikimedia.org/T402369 [07:37:38] kart_, abijeet: Over to you :) [07:37:57] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 28126 [07:38:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28126 [07:38:39] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 262777 [07:38:58] Thanks phuedx [07:39:00] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262777 [07:39:05] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 41327 [07:39:45] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 41327 [07:39:49] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 267536 [07:40:12] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 267536 [07:40:17] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 269396 [07:40:35] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 269396 [07:40:38] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 268795 [07:40:50] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268795 [07:41:00] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 273363 [07:41:14] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 273363 [07:41:18] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 52968 [07:41:40] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52968 [07:41:44] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 266240 [07:42:26] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 266240 [07:42:31] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 9002 [07:43:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184645 (owner: 10Abijeet Patro) [07:43:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9002 [07:43:26] abijeet: started. [07:43:31] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 270735 [07:43:48] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270735 [07:43:49] CI ETA 11-12 minutes [07:43:56] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 199710 [07:43:56] kart_, ok [07:44:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199710 [07:44:40] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 262412 [07:44:54] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262412 [07:45:00] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 139628 [07:45:24] (03Merged) 10jenkins-bot: Revert^2 "TranslationUnitDTO: Make blob type properties writable" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184645 (owner: 10Abijeet Patro) [07:45:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 139628 [07:45:40] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 265249 [07:45:49] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1184645|Revert^2 "TranslationUnitDTO: Make blob type properties writable"]] [07:46:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 265249 [07:46:22] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 264011 [07:46:25] (03CR) 10KartikMistry: "Please go ahead @jrobson@wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 (owner: 10Jdlrobson) [07:47:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264011 [07:49:23] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 45014 [07:49:35] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45014 [07:49:40] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 52762 [07:49:54] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52762 [07:49:57] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 272207 [07:50:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3005.esams.wmnet with OS bookworm [07:50:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 272207 [07:50:15] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 262316 [07:50:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262316 [07:50:30] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 28652 [07:50:42] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28652 [07:50:46] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 265966 [07:50:47] kart_, im here [07:51:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 265966 [07:51:05] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 263908 [07:51:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263908 [07:51:29] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 269548 [07:51:38] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 269548 [07:51:44] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 266539 [07:51:51] (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [07:51:54] abijeet_: on the change, will ping for testing. [07:51:55] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 266539 [07:51:58] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 267614 [07:52:00] !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1184645|Revert^2 "TranslationUnitDTO: Make blob type properties writable"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:52:02] kart_, ok [07:52:09] (03PS1) 10Muehlenhoff: Add ganeti3005 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184705 (https://phabricator.wikimedia.org/T402259) [07:52:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 267614 [07:52:14] abijeet_: you can test now [07:52:15] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 53066 [07:52:42] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 53066 [07:52:46] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 262662 [07:53:04] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262662 [07:53:09] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 268188 [07:53:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268188 [07:53:32] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 7063 [07:53:44] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7063 [07:53:47] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 273421 [07:53:56] kart_, ok [07:54:05] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 273421 [07:54:09] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 212635 [07:54:54] kart_, tested, works [07:55:16] (03CR) 10Ayounsi: [C:03+1] Add ganeti3005 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184705 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [07:55:26] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 212635 [07:55:36] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 28604 [07:55:44] cool. syncing. [07:55:55] !log kartik@deploy1003 abi, kartik: Continuing with sync [07:55:56] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28604 [07:56:00] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 263016 [07:56:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263016 [07:56:18] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 263270 [07:56:40] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263270 [07:56:44] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 150178 [07:57:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 150178 [07:57:03] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti3005 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184705 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [07:57:04] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 270364 [07:57:16] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270364 [07:57:20] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 40731 [07:57:29] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40731 [07:57:32] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 268197 [07:57:42] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268197 [07:57:46] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 267517 [07:57:57] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 267517 [07:58:02] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 264927 [07:58:16] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264927 [07:58:21] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 7679 [07:59:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7679 [07:59:32] kart_, we can deploy [07:59:33] alright I'm done :) [07:59:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T402925)', diff saved to https://phabricator.wikimedia.org/P82525 and previous config saved to /var/cache/conftool/dbconfig/20250904-075945-ladsgroup.json [07:59:50] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [08:00:05] dancy and andre: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T0800). [08:01:19] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184645|Revert^2 "TranslationUnitDTO: Make blob type properties writable"]] (duration: 15m 30s) [08:01:27] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184706 [08:02:58] kart_, all done? [08:04:09] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:puppetserver::volatile avoid loading Spur data on certain host [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [08:04:22] (03CR) 10Filippo Giunchedi: [C:03+2] interface: create rt_tables.d as needed [puppet] - 10https://gerrit.wikimedia.org/r/1184510 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [08:05:02] (03CR) 10Muehlenhoff: wmcs: port ::instance to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [08:05:44] (03CR) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [08:07:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:07:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [08:08:56] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [08:09:10] abijeet_: sorry. yes. All done. [08:10:57] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:14:27] (03CR) 10Federico Ceratto: "See comment in https://phabricator.wikimedia.org/T403617#11143918" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [08:14:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P82526 and previous config saved to /var/cache/conftool/dbconfig/20250904-081453-ladsgroup.json [08:16:36] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 150178 [08:17:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 150178 [08:18:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3005.esams.wmnet [08:20:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3005.esams.wmnet to cluster esams03 and group B [08:21:09] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:21:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3005.esams.wmnet to cluster esams03 and group B [08:22:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:28:58] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695 (10mahmoud.abdelsattar.wmde) 03NEW [08:29:49] (03CR) 10David Caro: [C:03+1] "Can you run a pcc?" [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [08:30:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P82527 and previous config saved to /var/cache/conftool/dbconfig/20250904-083001-ladsgroup.json [08:30:17] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T400968#11146759 (10BTullis) 05Open→03Resolved [08:31:23] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.makevm for new host atlas3001.wikimedia.org [08:31:25] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:31:38] (03CR) 10KCVelaga: [C:03+1] "@kartik.mistry@gmail.com can you help with deployment for this one?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182621 (https://phabricator.wikimedia.org/T398057) (owner: 10KCVelaga) [08:32:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:33:04] (03PS1) 10Muehlenhoff: Add prometheus3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184707 (https://phabricator.wikimedia.org/T403620) [08:33:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:33:15] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11146776 (10Peter) [08:35:36] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1003" [08:35:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1003" [08:35:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:35:41] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache atlas3001.wikimedia.org on all recursors [08:35:44] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas3001.wikimedia.org on all recursors [08:36:09] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d8-eqiad - https://phabricator.wikimedia.org/T401240#11146785 (10VRiley-WMF) |Device A|Device A Port|Device B|Device B Port|Type|Notes|Length required| |----------|-----------------|----------|----------|-------|-----|-------------... [08:36:14] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas3001.wikimedia.org - ayounsi@cumin1003" [08:36:18] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas3001.wikimedia.org - ayounsi@cumin1003" [08:36:19] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host atlas3001.wikimedia.org [08:37:29] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11146787 (10VRiley-WMF) Was informed to update this with more information, but found out I already has updated this at 12:18PM [08:37:30] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11146786 (10mahmoud.abdelsattar.wmde) Logstash access is denied when I try to access it using my SSO. [08:38:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:39:54] !log kill and restart imposm on maps-test2001 - stuck since August 10, lag building up and alerts [08:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:00] moritzm: --^ [08:42:43] (03CR) 10Ayounsi: [C:03+1] Add prometheus3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184707 (https://phabricator.wikimedia.org/T403620) (owner: 10Muehlenhoff) [08:43:28] (03CR) 10Tiziano Fogli: [C:03+1] Add prometheus3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184707 (https://phabricator.wikimedia.org/T403620) (owner: 10Muehlenhoff) [08:44:40] (03CR) 10Filippo Giunchedi: "https://puppet-compiler.wmflabs.org/output/1184511/6845/" [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [08:44:50] (03CR) 10Filippo Giunchedi: [C:03+2] wmcs: port ::instance to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [08:45:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T402925)', diff saved to https://phabricator.wikimedia.org/P82528 and previous config saved to /var/cache/conftool/dbconfig/20250904-084508-ladsgroup.json [08:45:13] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:45:14] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [08:47:44] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11146860 (10Peter) [08:49:56] (03CR) 10Tiziano Fogli: "Thanks, I'm going to fix it." [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (owner: 10Tiziano Fogli) [08:50:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11146872 (10Clement_Goubert) >>! In T400661#11143881, @Jhancock.wm wrote: > @jasmine_ how do you feel about the server going in row D? doesn't look like we have one in that row.... [08:50:42] (03PS2) 10Tiziano Fogli: nrpe2nodexp: add alertmanager_team param to override role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (https://phabricator.wikimedia.org/T395446) [08:51:16] (03CR) 10Btullis: dse-k8s: Introduce opensearch-operator namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184568 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [08:52:14] (03CR) 10Muehlenhoff: [C:03+2] Add prometheus3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184707 (https://phabricator.wikimedia.org/T403620) (owner: 10Muehlenhoff) [08:53:17] (03CR) 10CI reject: [V:04-1] nrpe2nodexp: add alertmanager_team param to override role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [08:53:32] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11146879 (10ayounsi) [08:54:21] (03PS3) 10Tiziano Fogli: nrpe2nodexp: add alertmanager_team param to override role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (https://phabricator.wikimedia.org/T395446) [08:56:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host prometheus3004.esams.wmnet [08:56:11] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:57:48] (03CR) 10Tiziano Fogli: [C:03+2] nrpe2nodexp: add alertmanager_team param to override role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [08:59:53] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3004.esams.wmnet - jmm@cumin2002" [08:59:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3004.esams.wmnet - jmm@cumin2002" [08:59:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:00:00] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache prometheus3004.esams.wmnet on all recursors [09:00:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3004.esams.wmnet on all recursors [09:00:33] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3004.esams.wmnet - jmm@cumin2002" [09:00:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3004.esams.wmnet - jmm@cumin2002" [09:03:56] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:01] jmm@cumin2002 makevm (PID 1965233) is awaiting input [09:04:04] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:04:17] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11146922 (10karapayneWMDE) Hello! Wikidata EM at WMDE here. Can confirm that Mahmoud is our new Staff Engineer [09:04:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus3004.esams.wmnet with OS bookworm [09:07:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:07:39] (03CR) 10Ayounsi: [C:03+2] esams: remove sandbox filter [homer/public] - 10https://gerrit.wikimedia.org/r/1184507 (owner: 10Ayounsi) [09:09:00] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11146942 (10Vgutierrez) a:05Vgutierrez→03JMeybohm assigning the task to @JMeybohm, he is the SRE on clinic du... [09:09:29] (03PS1) 10Cathal Mooney: Nokia: Save some vars to data{} dict so it only needs to be done once [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577) [09:09:31] (03Merged) 10jenkins-bot: esams: remove sandbox filter [homer/public] - 10https://gerrit.wikimedia.org/r/1184507 (owner: 10Ayounsi) [09:10:54] (03PS1) 10Filippo Giunchedi: wmcs: remove HighIOWaitStalling [alerts] - 10https://gerrit.wikimedia.org/r/1184715 (https://phabricator.wikimedia.org/T328502) [09:11:09] (03PS1) 10Btullis: Revert "Facilitate a role swap between an-mariadb1001 and an-mariadb1002" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184716 [09:11:41] (03PS1) 10Btullis: Revert "Use the standby analytics_meta mariadb server temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/1184717 [09:12:00] (03Abandoned) 10Cathal Mooney: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [09:12:21] (03CR) 10David Caro: [C:03+2] replica_cnf: disable ssl by default on replica.cnf files [puppet] - 10https://gerrit.wikimedia.org/r/1184484 (https://phabricator.wikimedia.org/T182892) (owner: 10David Caro) [09:13:56] FIRING: [5x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:48] (03Abandoned) 10Muehlenhoff: profile::wmcs::instance: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971455 (owner: 10Muehlenhoff) [09:15:13] (03CR) 10FNegri: [C:03+1] wmcs: remove HighIOWaitStalling [alerts] - 10https://gerrit.wikimedia.org/r/1184715 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [09:15:52] (03Abandoned) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [09:15:55] (03CR) 10David Caro: [C:03+1] wmcs: remove HighIOWaitStalling [alerts] - 10https://gerrit.wikimedia.org/r/1184715 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [09:15:58] (03CR) 10Filippo Giunchedi: [C:03+2] wmcs: remove HighIOWaitStalling [alerts] - 10https://gerrit.wikimedia.org/r/1184715 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [09:17:22] (03Abandoned) 10Muehlenhoff: smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) (owner: 10Muehlenhoff) [09:18:51] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11146990 (10JMeybohm) @KFrancis could you please verify/ensure the NDA has been signed? [09:23:27] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus3004.esams.wmnet with reason: host reimage [09:24:16] (03CR) 10Stevemunene: [C:03+1] Revert "Use the standby analytics_meta mariadb server temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/1184717 (owner: 10Btullis) [09:26:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2141.codfw.wmnet with reason: Maintenance [09:26:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:26:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T401906)', diff saved to https://phabricator.wikimedia.org/P82529 and previous config saved to /var/cache/conftool/dbconfig/20250904-092639-fceratto.json [09:26:43] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [09:28:26] (03CR) 10Btullis: [C:03+2] Revert "Use the standby analytics_meta mariadb server temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/1184717 (owner: 10Btullis) [09:28:27] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 7679 [09:28:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T401906)', diff saved to https://phabricator.wikimedia.org/P82530 and previous config saved to /var/cache/conftool/dbconfig/20250904-092849-fceratto.json [09:28:58] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 7679 [09:29:07] PROBLEM - MariaDB Replica IO: analytics_meta on db1208 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: could not find next log: the first event . at 4, the last event read from analytics-meta-bin.000323 at 536559012, the last byte read from analytics-meta-bin.000323 at 536559043. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%2 [09:29:07] ng_a_replica [09:29:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus3004.esams.wmnet with reason: host reimage [09:29:37] (03PS1) 10Muehlenhoff: Remove the update definition for thirdparty/helm3 [puppet] - 10https://gerrit.wikimedia.org/r/1184719 [09:31:27] (03CR) 10Stevemunene: [C:03+1] Revert "Facilitate a role swap between an-mariadb1001 and an-mariadb1002" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184716 (owner: 10Btullis) [09:33:24] (03CR) 10Btullis: [C:03+2] Revert "Facilitate a role swap between an-mariadb1001 and an-mariadb1002" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184716 (owner: 10Btullis) [09:33:56] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:35:06] (03Merged) 10jenkins-bot: Revert "Facilitate a role swap between an-mariadb1001 and an-mariadb1002" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184716 (owner: 10Btullis) [09:35:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:36:50] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [09:37:08] (03PS1) 10Vgutierrez: haproxy: Send client TLS fingerprint to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) [09:37:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [09:37:25] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [09:37:45] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [09:37:59] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [09:38:07] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [09:38:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [09:40:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [09:41:00] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [09:41:06] RECOVERY - MariaDB Replica IO: analytics_meta on db1208 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:42:59] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [09:43:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P82532 and previous config saved to /var/cache/conftool/dbconfig/20250904-094357-fceratto.json [09:44:04] (03CR) 10Tiziano Fogli: "The /var/lib/prometheus directory is created with 0755 permissions by the prometheus-node-exporter Debian package, along with the promethe" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:46:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus3004.esams.wmnet with OS bookworm [09:46:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus3004.esams.wmnet [09:48:31] (03CR) 10FNegri: [C:03+1] "The helm repo has moved recently:" [puppet] - 10https://gerrit.wikimedia.org/r/1184719 (owner: 10Muehlenhoff) [09:48:56] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:50:31] (03PS1) 10Filippo Giunchedi: Revert "wmcs: port ::instance to firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1184724 [09:51:01] (03CR) 10David Caro: [C:03+1] Revert "wmcs: port ::instance to firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1184724 (owner: 10Filippo Giunchedi) [09:51:07] (03CR) 10FNegri: [C:03+1] Revert "wmcs: port ::instance to firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1184724 (owner: 10Filippo Giunchedi) [09:51:43] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] Revert "wmcs: port ::instance to firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1184724 (owner: 10Filippo Giunchedi) [09:54:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast3007.wikimedia.org [09:54:37] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:54:47] (03PS1) 10David Caro: helm: update the repo [puppet] - 10https://gerrit.wikimedia.org/r/1184726 [09:55:26] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11147104 (10Peter) [09:58:14] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3007.wikimedia.org - jmm@cumin2002" [09:59:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P82533 and previous config saved to /var/cache/conftool/dbconfig/20250904-095904-fceratto.json [09:59:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3007.wikimedia.org - jmm@cumin2002" [09:59:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:59:40] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast3007.wikimedia.org on all recursors [09:59:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast3007.wikimedia.org on all recursors [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1000) [10:00:15] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast3007.wikimedia.org - jmm@cumin2002" [10:00:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast3007.wikimedia.org - jmm@cumin2002" [10:01:16] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! As discussed we'll likely need or be able to improve on the purely static approach in time but for now this should work ok." [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [10:02:41] !log imported jenkins 2.516.2 for Bullseye/Bookworm T403703 [10:02:43] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro) [10:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:44] T403703: Upgrade Jenkins to 2.516.2 - https://phabricator.wikimedia.org/T403703 [10:03:21] jmm@cumin2002 makevm (PID 1995771) is awaiting input [10:06:47] (03CR) 10FNegri: [C:03+1] helm: update the repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro) [10:08:11] (03PS2) 10David Caro: helm: update the repo [puppet] - 10https://gerrit.wikimedia.org/r/1184726 [10:08:11] (03CR) 10David Caro: helm: update the repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro) [10:08:24] jmm@cumin2002 makevm (PID 1995771) is awaiting input [10:09:14] (03CR) 10David Caro: [C:04-1] "The key should be armored, not unarmored, updating" [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro) [10:10:24] (03PS3) 10David Caro: helm: update the repo [puppet] - 10https://gerrit.wikimedia.org/r/1184726 [10:14:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T401906)', diff saved to https://phabricator.wikimedia.org/P82534 and previous config saved to /var/cache/conftool/dbconfig/20250904-101412-fceratto.json [10:14:16] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [10:14:28] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2146.codfw.wmnet with reason: Maintenance [10:14:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast3007.wikimedia.org with OS bookworm [10:14:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T401906)', diff saved to https://phabricator.wikimedia.org/P82535 and previous config saved to /var/cache/conftool/dbconfig/20250904-101435-fceratto.json [10:14:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast3007.wikimedia.org with OS bookworm [10:16:05] (03CR) 10FNegri: [C:03+1] helm: update the repo [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro) [10:16:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T401906)', diff saved to https://phabricator.wikimedia.org/P82536 and previous config saved to /var/cache/conftool/dbconfig/20250904-101645-fceratto.json [10:17:00] (03CR) 10Muehlenhoff: [C:03+1] "Didn't check the key, but the config change looks good. After merging you can validate by forcing a Puppet run on apt1002 and then running" [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro) [10:17:35] (03Abandoned) 10Muehlenhoff: Remove the update definition for thirdparty/helm3 [puppet] - 10https://gerrit.wikimedia.org/r/1184719 (owner: 10Muehlenhoff) [10:17:58] (03CR) 10Ladsgroup: "If this is going to be on all wikis eventually, just put dblist: all instead. You don't need to update this every time you deploy to new w" [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz) [10:21:16] (03CR) 10Muehlenhoff: [C:03+2] Remove ncredir3003 [puppet] - 10https://gerrit.wikimedia.org/r/1184062 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [10:24:45] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir3003.esams.wmnet [10:26:51] (03CR) 10Elukey: [C:03+1] "I like it, left a comment to add more info but the rest looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1180823 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [10:28:28] (03PS1) 10Ayounsi: Revert "Remove esams RIPE Atlas measurements" [puppet] - 10https://gerrit.wikimedia.org/r/1184731 [10:28:45] (03PS1) 10Ayounsi: Revert "Remove atlas3001 from monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/1184732 [10:29:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:29:37] (03PS2) 10Ayounsi: Revert "Remove atlas3001 from monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/1184732 [10:29:56] (03PS3) 10Ayounsi: Revert "Remove atlas3001 from monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/1184732 [10:31:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P82537 and previous config saved to /var/cache/conftool/dbconfig/20250904-103153-fceratto.json [10:34:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:36:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:36:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:36:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir3003.esams.wmnet [10:36:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147233 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir3003.esams.wmnet` - ncredir3003.esams.wmnet (**PASS**) - Downtimed host o... [10:38:25] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast3007.wikimedia.org with reason: host reimage [10:38:54] (03PS1) 10Muehlenhoff: Drop esams01 cluster and reimage ganeti3007 [puppet] - 10https://gerrit.wikimedia.org/r/1184734 (https://phabricator.wikimedia.org/T402259) [10:38:56] (03PS1) 10Muehlenhoff: Remove esams01 from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1184735 (https://phabricator.wikimedia.org/T402259) [10:42:41] (03CR) 10Ladsgroup: "Generally speaking, it looks good to me. I think there have been some concerns over using seconds behind master since it's not as accurate" [alerts] - 10https://gerrit.wikimedia.org/r/1184039 (https://phabricator.wikimedia.org/T315866) (owner: 10Tiziano Fogli) [10:43:02] (03PS2) 10Muehlenhoff: Remove esams01 from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1184735 (https://phabricator.wikimedia.org/T402259) [10:43:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast3007.wikimedia.org with reason: host reimage [10:46:00] (03CR) 10Jcrespo: "So I am not saying what's the solution. There is probably other options, of which I suggested one. What worries me is to make certain thin" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:46:04] (03CR) 10Dreamy Jazz: "We don't intend to deploy to all wikis eventually AFAIK." [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz) [10:47:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P82538 and previous config saved to /var/cache/conftool/dbconfig/20250904-104700-fceratto.json [10:48:27] (03CR) 10Jcrespo: "> The /var/lib/prometheus directory is created with 0755 permissions by the prometheus-node-exporter Debian package" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:55:51] (03CR) 10Cathal Mooney: [C:03+1] Nokia: /routing-policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [10:58:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast3007.wikimedia.org with OS bookworm [10:58:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast3007.wikimedia.org [10:58:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast3007.wikimedia.org with OS bookworm completed: - bast300... [10:58:33] (03CR) 10Ayounsi: [C:03+2] Revert "Remove atlas3001 from monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/1184732 (owner: 10Ayounsi) [10:58:39] (03CR) 10Ayounsi: [C:03+2] Revert "Remove esams RIPE Atlas measurements" [puppet] - 10https://gerrit.wikimedia.org/r/1184731 (owner: 10Ayounsi) [10:58:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:00:07] (03CR) 10Ayounsi: [C:03+1] Drop esams01 cluster and reimage ganeti3007 [puppet] - 10https://gerrit.wikimedia.org/r/1184734 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [11:00:21] (03CR) 10Ayounsi: [C:03+1] Remove esams01 from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1184735 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [11:02:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T401906)', diff saved to https://phabricator.wikimedia.org/P82539 and previous config saved to /var/cache/conftool/dbconfig/20250904-110207-fceratto.json [11:02:11] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:02:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2153.codfw.wmnet with reason: Maintenance [11:02:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T401906)', diff saved to https://phabricator.wikimedia.org/P82540 and previous config saved to /var/cache/conftool/dbconfig/20250904-110230-fceratto.json [11:02:37] (03CR) 10Ayounsi: [C:03+2] Nokia: /routing-policy [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [11:03:56] (03Merged) 10jenkins-bot: Nokia: /routing-policy [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [11:04:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T401906)', diff saved to https://phabricator.wikimedia.org/P82541 and previous config saved to /var/cache/conftool/dbconfig/20250904-110440-fceratto.json [11:06:52] (03PS1) 10Muehlenhoff: Readd bast3007 as bastion node [puppet] - 10https://gerrit.wikimedia.org/r/1184742 (https://phabricator.wikimedia.org/T402259) [11:08:30] 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11147326 (10LSobanski) Just a heads up that the alert fired again, can it be silenced for another month? [11:08:56] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:09:06] (03CR) 10Ladsgroup: [C:03+1] "Tests are always appreciated" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 (owner: 10Krinkle) [11:09:41] (03PS2) 10Muehlenhoff: Readd bast3007 as bastion node [puppet] - 10https://gerrit.wikimedia.org/r/1184742 (https://phabricator.wikimedia.org/T402259) [11:15:09] (03CR) 10Ayounsi: "A few comments but overall lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [11:19:09] (03PS2) 10Cathal Mooney: Nokia: Save some vars to data{} dict so it only needs to be done once [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577) [11:19:48] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@9a6431c] (releasing): Update backup releases Jenkins [11:19:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P82542 and previous config saved to /var/cache/conftool/dbconfig/20250904-111947-fceratto.json [11:21:23] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@9a6431c] (releasing): Update backup releases Jenkins (duration: 02m 09s) [11:22:50] (03PS3) 10Muehlenhoff: Remove esams01 from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1184735 (https://phabricator.wikimedia.org/T402259) [11:27:05] (03PS2) 10Jcrespo: dbbackups: Initial setup of dbprov1007, dbprov2007 [puppet] - 10https://gerrit.wikimedia.org/r/1182865 (https://phabricator.wikimedia.org/T403166) [11:27:57] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [11:28:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2147 (T402925)', diff saved to https://phabricator.wikimedia.org/P82543 and previous config saved to /var/cache/conftool/dbconfig/20250904-112804-ladsgroup.json [11:28:08] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:29:21] (03CR) 10Muehlenhoff: [C:03+2] Remove esams01 from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1184735 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [11:31:05] (03CR) 10David Caro: [C:03+2] helm: update the repo [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro) [11:32:01] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@b41bbe7] (releasing): Testing [11:32:05] (03CR) 10Jcrespo: [C:03+2] dbbackups: Initial setup of dbprov1007, dbprov2007 [puppet] - 10https://gerrit.wikimedia.org/r/1182865 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [11:32:29] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@b41bbe7] (releasing): Testing (duration: 00m 38s) [11:34:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P82544 and previous config saved to /var/cache/conftool/dbconfig/20250904-113455-fceratto.json [11:36:19] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@b41bbe7] (releasing): Testing [11:36:45] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@b41bbe7] (releasing): Testing (duration: 00m 26s) [11:39:13] (03CR) 10D3r1ck01: tests: Add test for wmfApplyEtcdDBConfig() (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 (owner: 10Krinkle) [11:43:26] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@b41bbe7] (releasing): Update production releases Jenkins [11:44:01] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@b41bbe7] (releasing): Update production releases Jenkins (duration: 00m 36s) [11:44:07] (03PS3) 10Cathal Mooney: Nokia: Save some vars to data{} dict so it only needs to be done once [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577) [11:45:34] FIRING: [2x] ProbeDown: Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:46:01] (03CR) 10Muehlenhoff: [C:03+2] Drop esams01 cluster and reimage ganeti3007 [puppet] - 10https://gerrit.wikimedia.org/r/1184734 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [11:48:02] (03CR) 10Ayounsi: [C:03+1] Readd bast3007 as bastion node [puppet] - 10https://gerrit.wikimedia.org/r/1184742 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [11:50:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T401906)', diff saved to https://phabricator.wikimedia.org/P82545 and previous config saved to /var/cache/conftool/dbconfig/20250904-115002-fceratto.json [11:50:06] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:50:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2170.codfw.wmnet with reason: Maintenance [11:50:20] (03CR) 10Ayounsi: [C:03+1] Nokia: Save some vars to data{} dict so it only needs to be done once [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [11:50:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T401906)', diff saved to https://phabricator.wikimedia.org/P82546 and previous config saved to /var/cache/conftool/dbconfig/20250904-115025-fceratto.json [11:51:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T401906)', diff saved to https://phabricator.wikimedia.org/P82547 and previous config saved to /var/cache/conftool/dbconfig/20250904-115135-fceratto.json [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1200) [12:04:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3007.esams.wmnet with OS bookworm [12:06:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P82548 and previous config saved to /var/cache/conftool/dbconfig/20250904-120646-fceratto.json [12:06:54] (03CR) 10Cathal Mooney: [C:03+2] Nokia: Save some vars to data{} dict so it only needs to be done once (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:08:38] (03Merged) 10jenkins-bot: Nokia: Save some vars to data{} dict so it only needs to be done once [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:14:01] !log Upgrade envoyproxy on vrts1003 T402584 [12:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:04] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [12:15:25] (03PS1) 10David Caro: helm: use the id of the subkey, not the parent key [puppet] - 10https://gerrit.wikimedia.org/r/1184754 [12:19:32] (03CR) 10David Caro: [C:03+2] helm: use the id of the subkey, not the parent key [puppet] - 10https://gerrit.wikimedia.org/r/1184754 (owner: 10David Caro) [12:21:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P82549 and previous config saved to /var/cache/conftool/dbconfig/20250904-122153-fceratto.json [12:22:50] (03PS1) 10Cathal Mooney: Nokia: module to configure BGP in network-instance and add IBGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/1184759 (https://phabricator.wikimedia.org/T402577) [12:23:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:24:09] (03CR) 10CI reject: [V:04-1] Nokia: module to configure BGP in network-instance and add IBGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/1184759 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:25:09] (03PS1) 10Arnaudb: gerrit: add alloy from grafana repo [puppet] - 10https://gerrit.wikimedia.org/r/1184756 (https://phabricator.wikimedia.org/T402611) [12:26:02] (03CR) 10Muehlenhoff: [C:03+2] Readd bast3007 as bastion node [puppet] - 10https://gerrit.wikimedia.org/r/1184742 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [12:26:10] (03PS2) 10Cathal Mooney: Nokia: module to configure BGP in network-instance and add IBGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/1184759 (https://phabricator.wikimedia.org/T402577) [12:26:22] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3007.esams.wmnet with reason: host reimage [12:27:55] (03PS1) 10Cory Massaro: Increase max recursion depth in the orchestrator's composition language. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 [12:28:36] (03PS2) 10Cory Massaro: WIP: Increase max recursion depth in the orchestrator's composition language. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 [12:28:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:33:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3007.esams.wmnet with reason: host reimage [12:35:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:37:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T401906)', diff saved to https://phabricator.wikimedia.org/P82550 and previous config saved to /var/cache/conftool/dbconfig/20250904-123701-fceratto.json [12:37:06] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:37:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2173.codfw.wmnet with reason: Maintenance [12:37:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T401906)', diff saved to https://phabricator.wikimedia.org/P82551 and previous config saved to /var/cache/conftool/dbconfig/20250904-123723-fceratto.json [12:38:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T401906)', diff saved to https://phabricator.wikimedia.org/P82552 and previous config saved to /var/cache/conftool/dbconfig/20250904-123833-fceratto.json [12:40:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:43:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:47:20] (03PS1) 10Jgreen: Add frmx2002.wikimedia.org A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1184771 (https://phabricator.wikimedia.org/T403673) [12:48:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:53:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P82553 and previous config saved to /var/cache/conftool/dbconfig/20250904-125341-fceratto.json [12:54:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3007.esams.wmnet with OS bookworm [12:58:23] (03PS1) 10Slyngshede: P:cache::haproxy avoid hardcoding wme ranges [puppet] - 10https://gerrit.wikimedia.org/r/1184772 [13:00:05] Urbanecm and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:03:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:03:56] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:00] !log push pfw policies - T403717 [13:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:04] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:08:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P82554 and previous config saved to /var/cache/conftool/dbconfig/20250904-130848-fceratto.json [13:10:46] (03PS1) 10Muehlenhoff: Add ganeti3007 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184774 (https://phabricator.wikimedia.org/T402259) [13:12:15] (03PS1) 10Tiziano Fogli: prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184776 (https://phabricator.wikimedia.org/T395130) [13:13:34] !log upgrading CI Jenkins | T403703 [13:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:37] T403703: Upgrade Jenkins to 2.516.2 - https://phabricator.wikimedia.org/T403703 [13:13:56] FIRING: [5x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:49] (03CR) 10Krinkle: "I've removed the overrides via Horizon, logged at:" [puppet] - 10https://gerrit.wikimedia.org/r/1183275 (owner: 10Krinkle) [13:18:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:20:17] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from dumpsdata1006 to an-worker1235 [13:20:26] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [13:21:00] (03Abandoned) 10Tiziano Fogli: prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184776 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [13:21:34] (03CR) 10Ayounsi: [C:03+1] Add ganeti3007 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184774 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [13:23:05] (03PS1) 10Tiziano Fogli: prometheus3004: assign prometheus::pop role, setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184778 (https://phabricator.wikimedia.org/T403620) [13:23:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:23:25] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11147809 (10Papaul) [13:23:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T401906)', diff saved to https://phabricator.wikimedia.org/P82555 and previous config saved to /var/cache/conftool/dbconfig/20250904-132356-fceratto.json [13:24:02] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:24:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2174.codfw.wmnet with reason: Maintenance [13:24:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T401906)', diff saved to https://phabricator.wikimedia.org/P82556 and previous config saved to /var/cache/conftool/dbconfig/20250904-132419-fceratto.json [13:24:28] (03Abandoned) 10Tiziano Fogli: prometheus3004: assign prometheus::pop role, setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184778 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [13:24:29] !log dcaro@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cloudcephosd1052.eqiad.wmnet with reason: swapping network card [13:25:06] 10ops-codfw, 06SRE, 06DC-Ops: codfw: document SCS ports in Netbox - https://phabricator.wikimedia.org/T403634#11147841 (10Papaul) p:05Triage→03Medium [13:25:21] (03PS1) 10Federico Ceratto: updates: correct Suite names for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/1184781 (https://phabricator.wikimedia.org/T397305) [13:25:22] (03CR) 10Federico Ceratto: "As discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1184781 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto) [13:25:52] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [13:26:10] btullis@cumin1003 rename (PID 999795) is awaiting input [13:26:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T401906)', diff saved to https://phabricator.wikimedia.org/P82557 and previous config saved to /var/cache/conftool/dbconfig/20250904-132630-fceratto.json [13:28:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:28:29] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti3007 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184774 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [13:29:34] (03CR) 10Jgreen: [C:03+2] Add frmx2002.wikimedia.org A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1184771 (https://phabricator.wikimedia.org/T403673) (owner: 10Jgreen) [13:29:43] (03PS1) 10Tiziano Fogli: prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184783 (https://phabricator.wikimedia.org/T403620) [13:29:44] (03PS1) 10Tiziano Fogli: prometheus3004: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1184784 (https://phabricator.wikimedia.org/T403620) [13:29:46] (03PS1) 10Tiziano Fogli: prometheus::pop: enable rsyncd on esams [puppet] - 10https://gerrit.wikimedia.org/r/1184785 (https://phabricator.wikimedia.org/T403620) [13:29:53] !log jgreen@dns1004 START - running authdns-update [13:30:04] !log mforns@deploy1003 Started deploy [analytics/refinery@a1f5011]: Fix for pageview actor automated reasons [analytics/refinery@a1f5011b] [13:30:53] !log jgreen@dns1004 END - running authdns-update [13:31:02] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11147880 (10Jhancock.wm) 05Open→03Resolved [13:31:25] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dnsdse-k8s-worker1014 - jclark@cumin1002" [13:31:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dnsdse-k8s-worker1014 - jclark@cumin1002" [13:31:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:32:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11147887 (10Jhancock.wm) 05Open→03Resolved [13:32:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:32:57] !log mforns@deploy1003 Finished deploy [analytics/refinery@a1f5011]: Fix for pageview actor automated reasons [analytics/refinery@a1f5011b] (duration: 02m 52s) [13:33:28] !log mforns@deploy1003 Started deploy [analytics/refinery@a1f5011] (thin): Fix for pageview actor automated reasons THIN [analytics/refinery@a1f5011b] [13:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:25] !log mforns@deploy1003 Finished deploy [analytics/refinery@a1f5011] (thin): Fix for pageview actor automated reasons THIN [analytics/refinery@a1f5011b] (duration: 00m 57s) [13:34:54] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [13:35:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:35:21] (03PS1) 10Tiziano Fogli: prometheus3004: assign replica_label [puppet] - 10https://gerrit.wikimedia.org/r/1184786 (https://phabricator.wikimedia.org/T403620) [13:35:38] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6846/co" [puppet] - 10https://gerrit.wikimedia.org/r/1184772 (owner: 10Slyngshede) [13:35:48] !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:35:59] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [13:37:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:37:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147915 (10MoritzMuehlenhoff) [13:38:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147919 (10MoritzMuehlenhoff) [13:38:43] (03PS2) 10Slyngshede: P:cache::haproxy avoid hardcoding wme ranges [puppet] - 10https://gerrit.wikimedia.org/r/1184772 [13:38:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:38:45] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-worker1235 on all recursors [13:38:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-worker1235 on all recursors [13:38:49] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1235 [13:39:25] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147920 (10ayounsi) [13:41:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P82558 and previous config saved to /var/cache/conftool/dbconfig/20250904-134137-fceratto.json [13:41:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:41:51] btullis@cumin1003 rename (PID 999795) is awaiting input [13:42:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:42:50] (03CR) 10Tiziano Fogli: [C:03+2] prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184783 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [13:42:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1235 [13:42:56] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11147952 (10Jclark-ctr) a:03Jclark-ctr [13:43:00] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6847/console" [puppet] - 10https://gerrit.wikimedia.org/r/1184772 (owner: 10Slyngshede) [13:43:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11147954 (10Jclark-ctr) [13:43:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from dumpsdata1006 to an-worker1235 [13:43:38] (03PS1) 10Tiziano Fogli: Revert "prometheus3004: setup firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/1184787 [13:45:18] (03CR) 10Tiziano Fogli: [C:03+2] Revert "prometheus3004: setup firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/1184787 (owner: 10Tiziano Fogli) [13:45:40] (03CR) 10Herron: "Shall we set this to 'b' temporarily, since prom300[34] may be running at the same time?" [puppet] - 10https://gerrit.wikimedia.org/r/1184786 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [13:46:29] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:46:48] (03PS2) 10Tiziano Fogli: prometheus3004: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1184784 (https://phabricator.wikimedia.org/T403620) [13:46:48] (03PS2) 10Tiziano Fogli: prometheus::pop: enable rsyncd on esams [puppet] - 10https://gerrit.wikimedia.org/r/1184785 (https://phabricator.wikimedia.org/T403620) [13:46:48] (03PS2) 10Tiziano Fogli: prometheus3004: assign replica_label [puppet] - 10https://gerrit.wikimedia.org/r/1184786 (https://phabricator.wikimedia.org/T403620) [13:46:48] (03PS1) 10Tiziano Fogli: prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184789 (https://phabricator.wikimedia.org/T403620) [13:47:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:47:30] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:47:36] (03CR) 10MVernon: [C:03+1] updates: correct Suite names for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/1184781 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto) [13:48:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184784 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [13:48:54] (03CR) 10Tiziano Fogli: "No, they won’t run concurrently. One will replace the other, with a small gap in between." [puppet] - 10https://gerrit.wikimedia.org/r/1184786 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [13:48:56] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:49:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147975 (10MoritzMuehlenhoff) [13:50:08] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove VIP for esams01 - jmm@cumin2002" [13:50:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove VIP for esams01 - jmm@cumin2002" [13:50:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:50:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:51:33] (03PS1) 10Filippo Giunchedi: firewall: add LINK_LOCAL sets [puppet] - 10https://gerrit.wikimedia.org/r/1184791 (https://phabricator.wikimedia.org/T401899) [13:51:34] (03PS1) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) [13:51:36] (03PS1) 10Filippo Giunchedi: bird: use LINK_LOCAL sets [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) [13:52:29] (03CR) 10Filippo Giunchedi: "Enables I16048a91 and clean up in I83fe8bede" [puppet] - 10https://gerrit.wikimedia.org/r/1184791 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [13:52:38] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from dumpsdata1007 to an-worker1236 [13:52:58] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [13:54:00] (03CR) 10Tiziano Fogli: [C:03+2] prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184789 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [13:54:15] (03CR) 10Tiziano Fogli: [C:03+2] prometheus3004: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1184784 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [13:54:28] (03CR) 10Tiziano Fogli: [C:03+2] prometheus::pop: enable rsyncd on esams [puppet] - 10https://gerrit.wikimedia.org/r/1184785 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [13:55:07] (03CR) 10Tiziano Fogli: [C:03+2] prometheus3004: assign replica_label [puppet] - 10https://gerrit.wikimedia.org/r/1184786 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [13:56:28] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming dumpsdata1007 to an-worker1236 - btullis@cumin1003" [13:56:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming dumpsdata1007 to an-worker1236 - btullis@cumin1003" [13:56:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:56:45] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-worker1236 on all recursors [13:56:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P82560 and previous config saved to /var/cache/conftool/dbconfig/20250904-135645-fceratto.json [13:56:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-worker1236 on all recursors [13:56:49] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1236 [13:57:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1236 [13:57:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from dumpsdata1007 to an-worker1236 [13:58:12] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:58:24] (03CR) 10Federico Ceratto: [C:03+2] updates: correct Suite names for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/1184781 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto) [13:58:57] FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:59:12] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:59:12] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [14:00:23] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:01:02] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [14:01:54] (03PS1) 10Jgreen: nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) [14:01:57] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 270735 [14:02:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:02:26] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 270735 [14:02:35] (03CR) 10CI reject: [V:04-1] nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [14:03:22] (03CR) 10CDanis: haproxy: Send client TLS fingerprint to varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [14:03:31] (03PS3) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) [14:03:31] (03CR) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [14:03:31] (03PS1) 10Jforrester: Stop loading the Graph extension anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) [14:03:56] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:43] jclark@cumin1002 provision (PID 1450785) is awaiting input [14:06:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:07:30] 10SRE-SLO: Pyrra calculations for the Initial error budget value of calendar windows - https://phabricator.wikimedia.org/T403729 (10elukey) 03NEW [14:08:05] 10SRE-SLO: Pyrra calculations for the Initial error budget value of calendar windows - https://phabricator.wikimedia.org/T403729#11148078 (10elukey) [14:08:39] (03CR) 10Herron: [C:03+1] "Nice! 🙌" [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [14:10:31] (03PS2) 10Vgutierrez: haproxy: Send client TLS fingerprint to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) [14:10:48] (03PS1) 10Scott French: shellbox-syntaxhighlight: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184177 (https://phabricator.wikimedia.org/T403284) [14:10:50] (03CR) 10Vgutierrez: haproxy: Send client TLS fingerprint to varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [14:11:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:11:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:11:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:11:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1184791 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [14:11:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T401906)', diff saved to https://phabricator.wikimedia.org/P82561 and previous config saved to /var/cache/conftool/dbconfig/20250904-141152-fceratto.json [14:11:56] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:12:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2176.codfw.wmnet with reason: Maintenance [14:12:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T401906)', diff saved to https://phabricator.wikimedia.org/P82562 and previous config saved to /var/cache/conftool/dbconfig/20250904-141215-fceratto.json [14:13:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet [14:14:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T401906)', diff saved to https://phabricator.wikimedia.org/P82564 and previous config saved to /var/cache/conftool/dbconfig/20250904-141426-fceratto.json [14:14:40] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:14:47] (03PS1) 10Tiziano Fogli: prometheus/esams: remove 3003, add 3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184802 (https://phabricator.wikimedia.org/T403620) [14:16:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:16:35] (03CR) 10Kosta Harlan: dse-k8s-eqiad: Add ipoid-opensearch namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [14:16:47] (03CR) 10Tiziano Fogli: "To be submitted after the final rsync has completed and the services on 3003 have been stopped" [puppet] - 10https://gerrit.wikimedia.org/r/1184802 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [14:17:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:18:18] (03CR) 10Herron: [C:03+1] prometheus/esams: remove 3003, add 3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184802 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [14:20:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [14:21:03] (03PS9) 10Bking: dse-k8s-worker: Add sysctl setting that's required for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T402926) [14:21:12] (03CR) 10Bking: dse-k8s-worker: Add sysctl setting that's required for OpenSearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T402926) (owner: 10Bking) [14:21:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:22:34] (03CR) 10Effie Mouzeli: [C:03+1] shellbox-syntaxhighlight: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184177 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [14:23:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet [14:24:14] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [14:25:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:25:54] (03CR) 10Ayounsi: [C:03+1] "very nice!" [homer/public] - 10https://gerrit.wikimedia.org/r/1184759 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [14:25:54] !log upgrade Envoyproxy on webperf* T402584 [14:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:58] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [14:26:00] (03CR) 10Bking: [C:03+2] dse-k8s-worker: Add sysctl setting that's required for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T402926) (owner: 10Bking) [14:27:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T402925)', diff saved to https://phabricator.wikimedia.org/P82565 and previous config saved to /var/cache/conftool/dbconfig/20250904-142701-ladsgroup.json [14:27:07] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [14:27:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:28:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3007.esams.wmnet to cluster esams03 and group B [14:29:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P82566 and previous config saved to /var/cache/conftool/dbconfig/20250904-142933-fceratto.json [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1430) [14:31:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3007.esams.wmnet to cluster esams03 and group B [14:32:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11148207 (10MoritzMuehlenhoff) [14:32:09] (03CR) 10Vgutierrez: [C:03+2] beta: Update hieradata for fe_vcl_config from Horizon [puppet] - 10https://gerrit.wikimedia.org/r/1183275 (owner: 10Krinkle) [14:34:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir3005.esams.wmnet to drbd [14:34:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:34:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11148226 (10ops-monitoring-bot) VM ncredir3005.esams.wmnet switching disk type to drbd [14:38:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11148246 (10MoritzMuehlenhoff) [14:39:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:39:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509#11148265 (10MoritzMuehlenhoff) [14:41:30] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:42:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P82567 and previous config saved to /var/cache/conftool/dbconfig/20250904-144208-ladsgroup.json [14:42:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:44:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir3005.esams.wmnet to drbd [14:44:13] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:29] RECOVERY - Host ncredir3005 is UP: PING OK - Packet loss = 0%, RTA = 80.66 ms [14:44:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P82569 and previous config saved to /var/cache/conftool/dbconfig/20250904-144441-fceratto.json [14:46:06] !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mr1-ulsfo with reason: Bgp testing [14:46:14] 06SRE, 06Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11148288 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=829d4d0b-c9d0-4961-b07b-d12e8f1ac430) set by pt1979@cumin2002 for 2:00:00 on 1 host(s) and their... [14:47:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:50:07] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/esams: remove 3003, add 3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184802 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [14:50:21] (03Abandoned) 10Bking: elastic: add test hieradata to help with LVS migration [puppet] - 10https://gerrit.wikimedia.org/r/1130162 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [14:51:13] !log disable OSPF on mr1-ulsfo to test BGP [14:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:21] ^ ncredir3005 was depooled (part of the routed ganeti update) [14:51:34] !log upgrade Envoyproxy on Puppet servers T402584 [14:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:41] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [14:52:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:54:49] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:54:51] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:55:04] (03Abandoned) 10Bking: opensearch-k8s: allow setting vm.max_map_count [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) (owner: 10Ryan Kemper) [14:55:29] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [14:55:38] (03CR) 10CDanis: [C:03+1] haproxy: Send client TLS fingerprint to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [14:55:41] (03CR) 10Filippo Giunchedi: [C:03+2] firewall: add LINK_LOCAL sets [puppet] - 10https://gerrit.wikimedia.org/r/1184791 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [14:55:51] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:56:19] (03PS1) 10Tiziano Fogli: prometheus/esams: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1184814 (https://phabricator.wikimedia.org/T403620) [14:56:27] PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [14:56:47] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 73.64 ms [14:56:47] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.30 ms [14:57:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:57:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P82570 and previous config saved to /var/cache/conftool/dbconfig/20250904-145716-ladsgroup.json [14:57:30] (03CR) 10Vgutierrez: [C:03+2] haproxy: Send client TLS fingerprint to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [14:57:52] 06SRE, 10envoy, 06serviceops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Apply Envoy updates to wcqs and wdqs hosts - https://phabricator.wikimedia.org/T403738 (10bking) 03NEW [14:58:52] sigh I missed a ; in ferm, there will be failures [14:58:57] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:58:58] sending a fixup now [14:59:35] jhancock@cumin1002 provision (PID 1576767) is awaiting input [14:59:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:59:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T401906)', diff saved to https://phabricator.wikimedia.org/P82571 and previous config saved to /var/cache/conftool/dbconfig/20250904-145948-fceratto.json [14:59:52] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:59:53] (03PS1) 10Filippo Giunchedi: ferm: fixup LINK_LOCAL definition [puppet] - 10https://gerrit.wikimedia.org/r/1184815 [15:00:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2188.codfw.wmnet with reason: Maintenance [15:00:05] dancy and andre: I, the Bot under the Fountain, call upon thee, The Deployer, to do Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1500). [15:00:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T401906)', diff saved to https://phabricator.wikimedia.org/P82572 and previous config saved to /var/cache/conftool/dbconfig/20250904-150011-fceratto.json [15:00:15] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:00:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184815 (owner: 10Filippo Giunchedi) [15:00:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:00:48] (03CR) 10Filippo Giunchedi: [C:03+1] ferm: fixup LINK_LOCAL definition [puppet] - 10https://gerrit.wikimedia.org/r/1184815 (owner: 10Filippo Giunchedi) [15:00:56] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] ferm: fixup LINK_LOCAL definition [puppet] - 10https://gerrit.wikimedia.org/r/1184815 (owner: 10Filippo Giunchedi) [15:02:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [15:02:22] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:02:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T401906)', diff saved to https://phabricator.wikimedia.org/P82573 and previous config saved to /var/cache/conftool/dbconfig/20250904-150221-fceratto.json [15:02:31] (03PS1) 10Ayounsi: Add ulsfo private v4 range to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1184816 (https://phabricator.wikimedia.org/T294845) [15:03:57] FIRING: [2x] SLOMetricAbsent: wdqs-main-availability esams - https://slo.wikimedia.org/?search=wdqs-main-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:04:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:05:07] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/esams: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1184814 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [15:05:21] !log tappof@dns1004 START - running authdns-update [15:06:22] !log tappof@dns1004 END - running authdns-update [15:08:57] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:09:45] (03CR) 10Filippo Giunchedi: "I am no longer in o11y, adding o11y folks instead for deployment" [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [15:11:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11148443 (10Jhancock.wm) @elukey it still fails with just the BIOS update. moving on to idrac and ssd updates. [15:11:12] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye [15:12:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [15:12:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T402925)', diff saved to https://phabricator.wikimedia.org/P82574 and previous config saved to /var/cache/conftool/dbconfig/20250904-151223-ladsgroup.json [15:12:28] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [15:12:28] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [15:12:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T402925)', diff saved to https://phabricator.wikimedia.org/P82575 and previous config saved to /var/cache/conftool/dbconfig/20250904-151235-ladsgroup.json [15:13:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1236.eqiad.wmnet with OS bullseye [15:13:57] FIRING: [5x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:13:57] RESOLVED: [2x] SLOMetricAbsent: wdqs-main-availability esams - https://slo.wikimedia.org/?search=wdqs-main-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:16:37] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:17:15] 06SRE, 10Observability-Metrics, 06Traffic: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266#11148493 (10Peachey88) 05Stalled→03Resolved p:05Unbreak!→03Medium a:03ssingh [15:17:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P82576 and previous config saved to /var/cache/conftool/dbconfig/20250904-151729-fceratto.json [15:17:34] !log installing apache2 security updates [15:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:47] 06SRE, 10Observability-Metrics: Port Prometheus dashboards to Thanos - https://phabricator.wikimedia.org/T256954#11148499 (10Aklapper) 05Stalled→03Open p:05Unbreak!→03Medium [15:18:02] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:20:34] (03CR) 10Cathal Mooney: [C:03+1] "Good spot yep this is what is needed." [homer/public] - 10https://gerrit.wikimedia.org/r/1184816 (https://phabricator.wikimedia.org/T294845) (owner: 10Ayounsi) [15:20:46] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:21:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:22:02] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:22:21] !log upgrade Envoyproxy on cloudweb servers T402584 [15:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:24] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [15:22:27] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:23:19] (03PS8) 10Scott French: P:mediawiki::php: add 8.3 and simplify versioning [puppet] - 10https://gerrit.wikimedia.org/r/1184101 [15:24:06] (03CR) 10Scott French: "Great! Thanks again for the review and testing, Timo." [puppet] - 10https://gerrit.wikimedia.org/r/1184101 (owner: 10Scott French) [15:25:00] !log migration from prometheus3003.esams to prometheus3004 has been completed T403620 [15:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:03] T403620: Migrate prometheus3003 to prometheus3004 - https://phabricator.wikimedia.org/T403620 [15:26:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:27:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [15:27:34] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:31:48] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:32:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [15:32:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P82577 and previous config saved to /var/cache/conftool/dbconfig/20250904-153236-fceratto.json [15:33:56] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:45] RESOLVED: [5x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:37:14] jhancock@cumin1002 provision (PID 1638589) is awaiting input [15:39:34] btullis@cumin1003 reimage (PID 1012017) is awaiting input [15:44:24] btullis@cumin1003 reimage (PID 1012182) is awaiting input [15:45:49] FIRING: [2x] ProbeDown: Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:47:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [15:47:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T401906)', diff saved to https://phabricator.wikimedia.org/P82578 and previous config saved to /var/cache/conftool/dbconfig/20250904-154744-fceratto.json [15:47:49] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:48:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2202.codfw.wmnet with reason: Maintenance [15:48:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2203.codfw.wmnet with reason: Maintenance [15:48:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T401906)', diff saved to https://phabricator.wikimedia.org/P82579 and previous config saved to /var/cache/conftool/dbconfig/20250904-154824-fceratto.json [15:49:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T401906)', diff saved to https://phabricator.wikimedia.org/P82580 and previous config saved to /var/cache/conftool/dbconfig/20250904-154934-fceratto.json [15:50:29] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:55:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [15:57:51] (03PS6) 10Federico Ceratto: mysqld_exporter.pp: reset /var/log/prometheus perms [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) [16:00:05] jhathaway and moritzm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:05:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:08:57] (03PS1) 10Ahmon Dancy: buildkitd.toml.erb: Temporarily enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1184830 (https://phabricator.wikimedia.org/T396924) [16:11:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:12:03] (03PS2) 10Jgreen: nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) [16:12:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:14:17] (03CR) 10Ahmon Dancy: "Cherry-picked to gitlab-runners-puppetserver-01.gitlab-runners.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/1184830 (https://phabricator.wikimedia.org/T396924) (owner: 10Ahmon Dancy) [16:16:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:16:18] (03CR) 10Effie Mouzeli: [C:03+1] P:mediawiki::php: add 8.3 and simplify versioning [puppet] - 10https://gerrit.wikimedia.org/r/1184101 (owner: 10Scott French) [16:17:00] FYI, the intermittent MediaWikiMemcachedHighErrorRate alerts seem to be due to T401425. I'll follow up there. [16:17:01] T401425: Investigate memcache errors during wikidata and commons dumps runs - https://phabricator.wikimedia.org/T401425 [16:18:53] 10SRE-SLO, 10Citoid, 10VisualEditor, 10Editing-team (Kanban Board): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11148790 (10Mvolz) >>! In T345627#11138642, @elukey wrote: > @Mvolz Hi! Sorry for the dela... [16:19:31] jouncebot: nowandnext [16:19:31] For the next 0 hour(s) and 40 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1600) [16:19:31] In 0 hour(s) and 40 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1700) [16:19:31] In 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1700) [16:21:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:22:13] FYI, in a few minutes, I'll start piloting a fraction of traffic for one Shellbox service (syntaxhighlight) on PHP 8.3 (T403284). this will be reverted after a couple of hours of testing. [16:22:13] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [16:22:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:24:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:24:32] (03CR) 10Volans: [C:03+1] "Looks sane to me, but the amount of corner cases that we're tracking is becoming quite worrisome for the long term maintainability of the " [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [16:25:49] (03PS1) 10Hashar: releases-jenkins: fix httpbb monitoring probe [puppet] - 10https://gerrit.wikimedia.org/r/1184870 (https://phabricator.wikimedia.org/T403703) [16:27:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:28:31] (03CR) 10Scott French: "Thanks for the review, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184177 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [16:28:32] (03CR) 10Dzahn: [C:03+2] releases-jenkins: fix httpbb monitoring probe [puppet] - 10https://gerrit.wikimedia.org/r/1184870 (https://phabricator.wikimedia.org/T403703) (owner: 10Hashar) [16:28:35] (03CR) 10Scott French: [C:03+2] shellbox-syntaxhighlight: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184177 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [16:29:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:30:17] (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184177 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [16:31:07] re: releases1003 "down" - is not actually down but the HTML content changed due to a version upgrade. and monitoring checks for a string (that it was hoping could never change, like "log in" [16:31:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:31:35] I'm going to roll out some intended-to-be-no-op envoy config changes to {api,rest}-gateway in staging and then eventually prod -- swfrench-wmf I'm happy to wait and go after you if you prefer, but I don't think there should actually be any conflict [16:31:46] needless to say.. it changed.. they managed to do that and upstream did "log in" to "Sign in" ... [16:32:41] rzl: these should be disjoint enough in their effect that I'd say go ahead :) [16:32:48] 👍 [16:33:06] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:33:16] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:33:40] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:33:45] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:33:49] wowee look at us, operating this distributed system concurrently [16:35:04] !log upgrading and restarting envoyproxy on cephosd1001 for T402584 [16:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:07] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [16:35:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2216.codfw.wmnet with reason: Maintenance [16:35:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T401906)', diff saved to https://phabricator.wikimedia.org/P82581 and previous config saved to /var/cache/conftool/dbconfig/20250904-163517-fceratto.json [16:35:21] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [16:36:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:36:39] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11148901 (10RobH) >>! In T402536#11146541, @VRiley-WMF wrote: > I have noticed some of these cables are active and currently connected. Can you list them out specifically for double checking? Any of them th... [16:37:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T401906)', diff saved to https://phabricator.wikimedia.org/P82582 and previous config saved to /var/cache/conftool/dbconfig/20250904-163727-fceratto.json [16:37:58] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:38:29] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:39:19] !log started single-replica PHP 8.3 pilot on shellbox-syntaxhighlight in codfw - T403284 [16:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:22] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [16:39:50] !log upgrading and restarting envoyproxy on cephosd100[2-5] for T402584 [16:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:01] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:42:28] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:44:23] !log deployed chart 0.11.11 to api-gateway and rest-gateway staging, T403101 [16:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:26] T403101: Envoy config updates from v1.26 - https://phabricator.wikimedia.org/T403101 [16:44:49] !log upgrading and restarting envoyproxy on cephosd200[1-3] for T402584 [16:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:52] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [16:46:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:50:26] (03PS1) 10BryanDavis: developer-portal: Bump to 2025-09-04-122329-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184876 [16:50:34] FIRING: [2x] ProbeDown: Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:52:06] (03PS1) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) [16:52:09] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:52:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P82583 and previous config saved to /var/cache/conftool/dbconfig/20250904-165235-fceratto.json [16:52:47] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:52:58] !log started single-replica PHP 8.3 pilot on shellbox-syntaxhighlight in eqiad - T403284 [16:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:02] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [16:54:42] (03PS2) 10BryanDavis: hcaptcha: Redirect / to mw.o project page [puppet] - 10https://gerrit.wikimedia.org/r/1184157 [16:55:00] (03PS2) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) [16:55:34] RESOLVED: [2x] ProbeDown: Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:40] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [16:55:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11149025 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [16:55:59] (03CR) 10BryanDavis: hcaptcha: Redirect / to mw.o project page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184157 (owner: 10BryanDavis) [16:56:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:59:29] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [17:00:05] bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1700) [17:00:15] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [17:00:50] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-09-04-122329-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184876 (owner: 10BryanDavis) [17:02:11] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [17:02:29] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [17:02:35] (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-09-04-122329-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184876 (owner: 10BryanDavis) [17:03:25] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:03:57] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:02] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:04:04] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:04:08] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:04:15] (03PS3) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) [17:04:20] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:04:46] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:04:56] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:05:26] (03PS4) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) [17:07:08] !log deployed chart 0.11.11 to api-gateway and rest-gateway prod, T403101 [17:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:11] T403101: Envoy config updates from v1.26 - https://phabricator.wikimedia.org/T403101 [17:07:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P82584 and previous config saved to /var/cache/conftool/dbconfig/20250904-170743-fceratto.json [17:08:08] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:08:11] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.26 - https://phabricator.wikimedia.org/T403101#11149158 (10RLazarus) [17:08:49] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:08:56] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:09:15] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:15:02] (03CR) 10A smart kitten: hcaptcha: Respond with HTTP 405 to disallowed methods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis) [17:16:23] 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Make varnish-frontend-restart work on Beta Cluster - https://phabricator.wikimedia.org/T299054#11149191 (10Krinkle) I'm guessing the below has the same root cause, albeit on a deployment host, not a varnish host. ` krinkle@deployment-deploy04:~$ sudo tail -n1... [17:17:04] (03CR) 10Dzahn: "@denisse: has Grafana Alloy been discussed before in observability? I am kind of surprised it seems at the same time common but not used b" [puppet] - 10https://gerrit.wikimedia.org/r/1184756 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [17:20:00] (03CR) 10Vgutierrez: hcaptcha: Respond with HTTP 405 to disallowed methods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis) [17:20:17] (03PS1) 10Tiziano Fogli: Revert "prometheus::pop: enable rsyncd on esams" [puppet] - 10https://gerrit.wikimedia.org/r/1184883 [17:21:12] (03CR) 10Tiziano Fogli: [C:03+2] Revert "prometheus::pop: enable rsyncd on esams" [puppet] - 10https://gerrit.wikimedia.org/r/1184883 (owner: 10Tiziano Fogli) [17:22:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T401906)', diff saved to https://phabricator.wikimedia.org/P82585 and previous config saved to /var/cache/conftool/dbconfig/20250904-172250-fceratto.json [17:22:54] (03PS1) 10Tiziano Fogli: prometheus3003: remove firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184885 (https://phabricator.wikimedia.org/T403620) [17:22:55] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [17:23:34] (03CR) 10BryanDavis: hcaptcha: Respond with HTTP 405 to disallowed methods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis) [17:24:00] (03PS5) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) [17:27:14] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1235.eqiad.wmnet with OS bullseye [17:27:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [17:28:04] (03CR) 10Tiziano Fogli: [C:03+2] prometheus3003: remove firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184885 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [17:28:06] (03PS4) 10Jdlrobson: Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 [17:28:32] (03CR) 10Scott French: "Thank you both for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1184101 (owner: 10Scott French) [17:28:35] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1236.eqiad.wmnet with OS bullseye [17:30:01] (03CR) 10Scott French: [C:03+2] P:mediawiki::php: add 8.3 and simplify versioning [puppet] - 10https://gerrit.wikimedia.org/r/1184101 (owner: 10Scott French) [17:32:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [17:33:56] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:34:57] (03PS1) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) [17:35:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:35:33] (03PS2) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) [17:35:35] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [17:36:48] (03PS1) 10Vgutierrez: haproxy: Add an Allow header on 405 responses [puppet] - 10https://gerrit.wikimedia.org/r/1184887 (https://phabricator.wikimedia.org/T403767) [17:37:06] (03PS3) 10Aaron Schulz: Add restbase spec JSON files to which /rest_v1/?spec can be routed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175942 (https://phabricator.wikimedia.org/T397203) [17:37:10] (03PS4) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) [17:37:24] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184887 (https://phabricator.wikimedia.org/T403767) (owner: 10Vgutierrez) [17:38:45] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [17:39:18] (03PS5) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) [17:42:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [17:42:51] (03PS1) 10Tiziano Fogli: prometheus/esams: remove 3003 from prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1184888 (https://phabricator.wikimedia.org/T403620) [17:43:21] (03PS3) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) [17:43:21] (03PS6) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) [17:43:29] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [17:44:02] (03PS3) 10BryanDavis: hcaptcha: Redirect / to mw.o project page [puppet] - 10https://gerrit.wikimedia.org/r/1184157 [17:44:03] (03PS2) 10BryanDavis: hcaptcha: Respond with HTTP 405 to disallowed methods [puppet] - 10https://gerrit.wikimedia.org/r/1184158 [17:45:38] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/esams: remove 3003 from prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1184888 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli) [17:46:26] !log [WDQS] T403738 Rolling restart of `envoyproxy.service` on `wdqs-main`, 2 hosts at a time [17:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:30] T403738: Apply Envoy updates to wcqs and wdqs hosts - https://phabricator.wikimedia.org/T403738 [17:46:40] (03CR) 10BryanDavis: hcaptcha: Respond with HTTP 405 to disallowed methods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis) [17:46:40] (03CR) 10Reedy: [C:03+1] Stop loading the Graph extension anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [17:47:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [17:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:52:27] (03PS1) 10Ahmon Dancy: scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) [17:52:39] !log tappof@cumin1002 START - Cookbook sre.hosts.decommission for hosts prometheus3003.esams.wmnet [17:52:58] (03CR) 10CI reject: [V:04-1] scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [17:54:11] (03PS2) 10Ahmon Dancy: scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) [17:57:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [17:57:24] !log tappof@cumin1002 START - Cookbook sre.dns.netbox [17:58:14] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [17:59:49] 06SRE, 10envoy, 06serviceops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Apply Envoy updates to wcqs and wdqs hosts - https://phabricator.wikimedia.org/T403738#11149376 (10RKemper) Current status: [x] WCQS [x] WDQS main [x] WDQS scholarly [] WDQS public [18:00:04] dancy and andre: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1800). [18:00:13] o/ [18:00:41] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184892 (https://phabricator.wikimedia.org/T396378) [18:00:44] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184892 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot) [18:00:56] !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1002" [18:01:26] (03CR) 10Scott French: "Thank you very much, Effie! One issue and one question." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) (owner: 10Effie Mouzeli) [18:01:35] !log tappof@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1002" [18:01:35] !log tappof@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:01:36] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus3003.esams.wmnet [18:01:43] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184892 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot) [18:02:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [18:03:12] (03CR) 10Dzahn: "lgtm. should fix issue on beta deployment server. was not an issue in prod because .git already existed. which sounds like a manual init w" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:03:40] (03CR) 10Dzahn: [V:04-1] "unfortunately: https://puppet-compiler.wmflabs.org/output/1184891/6850/deploy1003.eqiad.wmnet/change.deploy1003.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:03:57] FIRING: [3x] JobUnavailable: Reduced availability for job envoy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:04:28] (03CR) 10Dzahn: [V:04-1] "so... if already declared at (file: /srv/jenkins/puppet-compiler/6850/change/src/modules/scap/manifests/master.pp, line: 58) then why is i" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:05:49] (03PS4) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) [18:06:29] 06SRE, 10envoy, 06serviceops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Apply Envoy updates to wcqs and wdqs hosts - https://phabricator.wikimedia.org/T403738#11149401 (10RKemper) a:03RKemper [18:07:00] (03CR) 10BryanDavis: P:puppetserver::volatile avoid loading Spur data on certain host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [18:07:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [18:07:57] (03CR) 10Krinkle: "PS3 applied in beta cluster (no-op):" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [18:08:11] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [18:08:19] (03CR) 10Dzahn: [V:04-1] "we could use "ensure_resource('file', '/srv/patches/.git', {'ensure' => 'directory' }) which should not error out for duplicate declarati" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:08:26] (03PS5) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) [18:08:31] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [18:08:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T402925)', diff saved to https://phabricator.wikimedia.org/P82586 and previous config saved to /var/cache/conftool/dbconfig/20250904-180855-ladsgroup.json [18:09:01] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [18:09:10] (03PS1) 10RLazarus: mediawiki: Update to configuration_1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184893 (https://phabricator.wikimedia.org/T403101) [18:09:43] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.17 refs T396378 [18:09:46] T396378: 1.45.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T396378 [18:09:50] (03CR) 10Dzahn: [V:04-1] "Is profile::kubernetes::deployment_server::mediawiki::release not applied on a deployment server in beta?" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:11:04] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11149438 (10VRiley-WMF) Oh! I see what you mean. I'll take care of them. I was misreading it. Sorry about that! [18:12:07] (03PS3) 10Ahmon Dancy: scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) [18:12:08] (03CR) 10Krinkle: "https://puppet-compiler.wmflabs.org/output/1184886/4889/cp1102.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [18:12:21] (03CR) 10Dzahn: "ah.. so .. the resource just needs to have a different title to not conflict with the same command for a different .git dir. use "command"" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:12:29] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11149442 (10VRiley-WMF) [18:12:33] (03CR) 10CI reject: [V:04-1] scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:12:40] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11149443 (10VRiley-WMF) 05Open→03Resolved [18:13:39] (03PS4) 10Ahmon Dancy: scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) [18:13:56] RESOLVED: [3x] JobUnavailable: Reduced availability for job envoy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:14:07] (03CR) 10CI reject: [V:04-1] scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:14:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [18:14:39] (03PS5) 10Ahmon Dancy: scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) [18:16:29] (03PS1) 10Scott French: mediawiki: ignore kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1184894 (https://phabricator.wikimedia.org/T401425) [18:16:33] (03PS6) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) [18:16:33] (03PS7) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) [18:17:16] (03PS3) 10Krinkle: varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510) [18:17:31] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:19:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [18:23:10] (03CR) 10RLazarus: [C:03+1] mediawiki: ignore kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1184894 (https://phabricator.wikimedia.org/T401425) (owner: 10Scott French) [18:23:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [18:24:02] (03CR) 10Ahmon Dancy: "Revised." [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:24:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P82587 and previous config saved to /var/cache/conftool/dbconfig/20250904-182403-ladsgroup.json [18:24:27] (03CR) 10Scott French: [C:03+2] mediawiki: ignore kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1184894 (https://phabricator.wikimedia.org/T401425) (owner: 10Scott French) [18:25:00] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11149491 (10KFrancis) Hi All, there is not an NDA on file for Mahmoud Abdelsattar. @mahmoud.abdelsattar.wmde Ple... [18:25:53] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [18:25:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11149495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm executed with errors: - dse-k8s-worker10... [18:26:00] (03Merged) 10jenkins-bot: mediawiki: ignore kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1184894 (https://phabricator.wikimedia.org/T401425) (owner: 10Scott French) [18:26:55] (03CR) 10Ssingh: [C:03+1] "Looks good, verified from RFC the Allow: format." [puppet] - 10https://gerrit.wikimedia.org/r/1184887 (https://phabricator.wikimedia.org/T403767) (owner: 10Vgutierrez) [18:28:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [18:28:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11149517 (10Jhancock.wm) @elukey okay so what i did today in terms of firmware updates is: cp2044 BIOS, iDRAC, SSD cp2046 BIOS, iDRAC only cp2047 BIOS only... [18:28:54] (03CR) 10Scott French: mediawiki-dumps-legacy: Use in-pod mcrouter container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) (owner: 10Effie Mouzeli) [18:31:43] (03PS3) 10Krinkle: tests: Add test for wmfApplyEtcdDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 [18:31:47] (03CR) 10Krinkle: tests: Add test for wmfApplyEtcdDBConfig() (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 (owner: 10Krinkle) [18:34:51] (03PS1) 10Dzahn: zuul::main: use profile docker::engine to install docker [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938) [18:35:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11149539 (10VRiley-WMF) es1057 - rack C3, U14 [18:35:46] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 40731 [18:36:10] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 40731 [18:36:51] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 212635 [18:38:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 212635 [18:38:36] (03CR) 10Dzahn: [V:03+1 C:03+1] "lgtm! https://puppet-compiler.wmflabs.org/output/1184891/6851/deploy1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:38:40] (03CR) 10Dzahn: [V:03+1 C:03+2] scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:39:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P82588 and previous config saved to /var/cache/conftool/dbconfig/20250904-183911-ladsgroup.json [18:41:33] (03CR) 10Ahmon Dancy: "Thanks Dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:48:08] (03PS1) 10Dreamy Jazz: Create checkuser-suggested-investigations.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184903 (https://phabricator.wikimedia.org/T403471) [18:48:16] (03CR) 10Dzahn: [V:03+1 C:03+2] "@Krinkle: this should have fixed the issue on beta deploy server. I confirmed it was noop on both prod servers." [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [18:49:03] (03CR) 10Ottomata: [C:03+1] "Thank you!!!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184706 (owner: 10PipelineBot) [18:49:10] (03CR) 10CI reject: [V:04-1] Create checkuser-suggested-investigations.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184903 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz) [18:49:57] (03PS4) 10Dreamy Jazz: tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) [18:50:09] (03PS3) 10Krinkle: Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595) [18:53:37] (03PS2) 10Dzahn: zuul::main: use profile docker::engine to install docker [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938) [18:53:40] (03PS8) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) [18:53:40] (03PS4) 10Krinkle: varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510) [18:54:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T402925)', diff saved to https://phabricator.wikimedia.org/P82590 and previous config saved to /var/cache/conftool/dbconfig/20250904-185418-ladsgroup.json [18:54:23] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [18:54:24] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [18:54:29] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [18:54:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T402925)', diff saved to https://phabricator.wikimedia.org/P82591 and previous config saved to /var/cache/conftool/dbconfig/20250904-185431-ladsgroup.json [18:54:34] (03PS2) 10Dreamy Jazz: Create checkuser-suggested-investigations.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184903 (https://phabricator.wikimedia.org/T403471) [18:54:46] jouncebot: nowandnext [18:54:46] For the next 1 hour(s) and 5 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1800) [18:54:46] In 1 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T2000) [18:55:30] (03CR) 10Dzahn: [V:04-1] ""Please ensure you declare profile::base::overlayfs: true in hiera."" [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:56:08] (03PS3) 10Dzahn: zuul::main: use profile docker::engine to install docker [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938) [18:56:20] jouncebot: help [18:56:20] **** JounceBot Help **** [18:56:20] JounceBot is a deployment helper bot for the Wikimedia movement. [18:56:20] Source at: https://gerrit.wikimedia.org/g/wikimedia/bots/jouncebot [18:56:20] Available commands: [18:56:20] HELP Print all commands known to the server. [18:56:20] NEXT Get the next deployment event(s if they happen at the same time). [18:56:21] NOW Get the current deployment event(s) or the time until the next. [18:56:21] NOWANDNEXT Get the current and next deployment event(s). [18:56:22] REFRESH Refresh my knowledge about deployments. [18:56:45] Anyone mind if I backport? [18:56:52] OK w/ me [18:56:56] Thanks [18:57:00] no concerns [18:57:20] My config change will be a no-op so shouldn't see any changes in any logs etc [18:57:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184903 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz) [18:57:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:58:12] (03Merged) 10jenkins-bot: Create checkuser-suggested-investigations.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184903 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz) [18:58:32] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1184903|Create checkuser-suggested-investigations.dblist (T403471)]] [18:58:33] (03CR) 10Dreamy Jazz: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz) [18:58:35] T403471: Document suggested investigations tables in tables-catalog.yaml - https://phabricator.wikimedia.org/T403471 [18:58:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:59:12] (03CR) 10D3r1ck01: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 (owner: 10Krinkle) [19:01:15] (03CR) 10Ssingh: varnish: factor out unified_mobile_domain_regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [19:02:04] (03CR) 10Dreamy Jazz: [C:03+1] Graph: Use new placeholder i18n from WikimediaMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [19:03:42] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1184903|Create checkuser-suggested-investigations.dblist (T403471)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:03:46] T403471: Document suggested investigations tables in tables-catalog.yaml - https://phabricator.wikimedia.org/T403471 [19:05:44] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [19:08:56] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:10:27] (03CR) 10Jforrester: "(Needs to wait for 1.45.0-wmf.18 to be everywhere.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [19:11:08] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184903|Create checkuser-suggested-investigations.dblist (T403471)]] (duration: 12m 36s) [19:11:12] T403471: Document suggested investigations tables in tables-catalog.yaml - https://phabricator.wikimedia.org/T403471 [19:13:21] I'm done with my deploys [19:13:57] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:27] (03CR) 10Dzahn: [V:03+1 C:03+1] "this also gets us overlayfs! -> https://puppet-compiler.wmflabs.org/output/1184900/6854/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:19:01] (03CR) 10Dzahn: [C:03+2] zuul::main: use profile docker::engine to install docker [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:20:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11149725 (10Jclark-ctr) [19:21:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:33:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:34:27] (03CR) 10Krinkle: varnish: factor out unified_mobile_domain_regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [19:38:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11149794 (10Jclark-ctr) @bking @BTullis Can you assist with preseed.yaml? It doesn’t appear to be configured for EFI booting on this server. EFI is required since it was ordered with NVMe Drives [19:41:34] jouncebot: nowandnext [19:41:34] For the next 0 hour(s) and 18 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1800) [19:41:34] In 0 hour(s) and 18 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T2000) [19:41:49] Anyone mind if I deploy a security patch? [19:42:23] is it the one that fixes the Evil Exploit that I'm actively abusing where, uh, uh, I uhh.... [19:42:24] idk [19:42:26] something funny [19:44:41] Yes. The very evil exploit [19:44:48] ughhh. Okay fine, I'll let you patch it. [19:44:53] :D [19:53:03] (03PS3) 10Cory Massaro: Increase max recursion depth in the orchestrator's composition language. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 [19:53:31] (03PS7) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) [19:53:39] !log dreamyjazz Deployed security patch for T403757 [19:53:57] (03CR) 10CI reject: [V:04-1] varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [19:54:46] (03PS8) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) [19:55:57] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [19:57:24] I'm done with my security deploy in case anyone wants to use the late backport window [19:58:43] (03PS1) 10RLazarus: mw-videoscaler: Upgrade to envoy 1.26.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184918 (https://phabricator.wikimedia.org/T402584) [19:59:32] (03PS3) 10Bking: dse-k8s-eqiad: Add opensearch-ipoid namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1184551 (https://phabricator.wikimedia.org/T403534) [20:00:06] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T2000). [20:00:06] No Gerrit patches in the queue for this window AFAICS. [20:04:46] (03CR) 10RLazarus: mw-videoscaler: Upgrade to envoy 1.26.8 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184918 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [20:04:48] (03CR) 10A smart kitten: hcaptcha: Respond with HTTP 405 to disallowed methods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis) [20:05:24] (03CR) 10Krinkle: "PCC for prod:" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [20:05:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:06:27] (03PS9) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) [20:07:51] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184918 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [20:10:31] (03CR) 10Bking: dse-k8s-eqiad: Add opensearch-ipoid namespaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184551 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [20:13:19] (03PS2) 10Bking: dse-k8s-eqiad: Add opensearch-ipoid namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534) [20:14:31] (03CR) 10Andrea Denisse: "Hi Daniel, I can't recall any discussions regarding it at the time but it's a tool worth exploring so I'll add it to my team's agenda." [puppet] - 10https://gerrit.wikimedia.org/r/1184756 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [20:14:56] (03CR) 10Bking: dse-k8s-eqiad: Add opensearch-ipoid namespaces (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [20:15:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:16:50] (03PS4) 10Krinkle: Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595) [20:16:50] (03PS2) 10Krinkle: Disable wmgUseMdotRouting on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184131 (https://phabricator.wikimedia.org/T403510) [20:16:51] (03PS5) 10Krinkle: varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510) [20:16:58] (03CR) 10CI reject: [V:04-1] Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [20:17:01] (03CR) 10CI reject: [V:04-1] Disable wmgUseMdotRouting on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184131 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:17:11] Krinkle: I'm very excited for that; good luck! [20:18:12] thx [20:18:13] (03PS5) 10Krinkle: Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595) [20:18:13] (03PS3) 10Krinkle: Disable wmgUseMdotRouting on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184131 (https://phabricator.wikimedia.org/T403510) [20:31:53] (03PS2) 10RLazarus: mw-videoscaler: Upgrade to envoy 1.26.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184918 (https://phabricator.wikimedia.org/T402584) [20:44:53] (03PS1) 10Dzahn: zuul::executor: add parameter for port and set it to 7100 [puppet] - 10https://gerrit.wikimedia.org/r/1184924 (https://phabricator.wikimedia.org/T395938) [20:49:44] (03CR) 10Bking: [C:03+2] opensearch-operator: Add chart for review (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:51:18] (03PS7) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [20:52:38] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:53:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:57:00] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11150061 (10Krinkle) [20:58:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T2100) [21:00:44] (03PS1) 10Bking: WIP: Introduce opensearch-operator to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184932 (https://phabricator.wikimedia.org/T397246) [21:03:48] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:03:56] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:04] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:11:27] Hey all - would like to get one quick sec patch out right now if I can. Let me know if I should hold off… [21:16:59] jhathaway@cumin1002 provision (PID 2164935) is awaiting input [21:17:29] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:18:07] 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Make varnish-frontend-restart work on Beta Cluster - https://phabricator.wikimedia.org/T299054#11150196 (10bd808) [21:18:57] 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Make varnish-frontend-restart work on Beta Cluster - https://phabricator.wikimedia.org/T299054#11150202 (10bd808) [21:19:05] 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Make varnish-frontend-restart work on Beta Cluster - https://phabricator.wikimedia.org/T299054#11150204 (10bd808) [21:23:45] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [21:25:45] (03PS1) 10BryanDavis: beta: Remove replica instance from wmgMainStashServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184937 (https://phabricator.wikimedia.org/T401227) [21:27:54] !log Deployed security fix for T403411 to 1.45.0-wmf.17 [21:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:03] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply new opensearch plugins pkg - bking@cumin1002 - T403749 [21:31:07] T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749 [21:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:35:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:37:04] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply new opensearch plugins pkg - bking@cumin1002 - T403749 [21:37:07] T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749 [21:38:20] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new opensearch plugins pkg - bking@cumin1002 - T403749 [21:48:56] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:52:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:57:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:59:10] FIRING: BFDdown: BFD session down between cr3-ulsfo and 198.35.26.209 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:00:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T402925)', diff saved to https://phabricator.wikimedia.org/P82593 and previous config saved to /var/cache/conftool/dbconfig/20250904-220017-ladsgroup.json [22:00:22] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [22:04:10] RESOLVED: BFDdown: BFD session down between cr3-ulsfo and 198.35.26.209 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:06:09] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new opensearch plugins pkg - bking@cumin1002 - T403749 [22:06:12] T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749 [22:11:20] (03Abandoned) 10Ahmon Dancy: buildkitd.toml.erb: Temporarily enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1184830 (https://phabricator.wikimedia.org/T396924) (owner: 10Ahmon Dancy) [22:15:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P82594 and previous config saved to /var/cache/conftool/dbconfig/20250904-221525-ladsgroup.json [22:21:48] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply new opensearch plugins pkg - ryankemper@cumin1002 - T403749 [22:21:51] T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749 [22:30:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P82595 and previous config saved to /var/cache/conftool/dbconfig/20250904-223032-ladsgroup.json [22:39:58] (03PS1) 10Scott French: P:rsyslog::kafka_shipper: configure output lookup for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184942 [22:40:00] (03PS3) 10Scott French: P:logstash::common: update filters for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184943 [22:41:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:41:47] (03CR) 10Scott French: "Thanks in advance for the review, Cole!" [puppet] - 10https://gerrit.wikimedia.org/r/1184943 (owner: 10Scott French) [22:45:07] (03CR) 10Papaul: [C:03+2] Add ulsfo private v4 range to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1184816 (https://phabricator.wikimedia.org/T294845) (owner: 10Ayounsi) [22:45:36] !log ryankemper@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply new opensearch plugins pkg - ryankemper@cumin1002 - T403749 [22:45:40] T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749 [22:45:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T402925)', diff saved to https://phabricator.wikimedia.org/P82596 and previous config saved to /var/cache/conftool/dbconfig/20250904-224540-ladsgroup.json [22:45:44] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [22:45:57] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [22:46:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2179 (T402925)', diff saved to https://phabricator.wikimedia.org/P82597 and previous config saved to /var/cache/conftool/dbconfig/20250904-224604-ladsgroup.json [22:46:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:47:55] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:48:55] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:51:11] (03PS5) 10Dreamy Jazz: tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) [22:51:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:51:46] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz) [22:53:50] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11150505 (10Papaul) mr1-ulsfo is now running BGP . All OSPF entries on mr1-ulsfo, cr3-ulsfo and cr4-ulsfo for the management network removed. [22:58:57] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:02:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:11:17] (03CR) 10Cwhite: [C:03+2] P:logstash::common: update filters for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184943 (owner: 10Scott French) [23:11:40] (03PS4) 10Scott French: P:logstash::common: update filters for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184943 [23:12:08] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply new opensearch plugins pkg - ryankemper@cumin1002 - T403749 [23:12:11] T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749 [23:12:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:13:57] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:24:15] (03CR) 10Cwhite: [C:03+2] P:logstash::common: update filters for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184943 (owner: 10Scott French) [23:27:53] (03PS2) 10Scott French: P:rsyslog::kafka_shipper: configure output lookup for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184942 [23:32:46] FIRING: Traffic bill over quota: Alert for device cr2-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [23:35:32] (03PS1) 10Scott French: shellbox-syntaxhighlight: revert single-replica 8.3 pilot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184950 (https://phabricator.wikimedia.org/T403284) [23:37:46] FIRING: [2x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [23:38:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184951 [23:38:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184951 (owner: 10TrainBranchBot) [23:38:15] (03CR) 10Scott French: [C:03+2] shellbox-syntaxhighlight: revert single-replica 8.3 pilot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184950 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [23:39:59] (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: revert single-replica 8.3 pilot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184950 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [23:40:53] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [23:41:01] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [23:41:24] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [23:41:36] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [23:41:48] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [23:41:53] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [23:42:31] !log finished single-replica PHP 8.3 pilot on shellbox-syntaxhighlight - T403284 [23:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:34] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [23:50:46] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184951 (owner: 10TrainBranchBot) [23:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:52:46] FIRING: [2x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [23:57:46] RESOLVED: Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota