[00:00:48] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1183700|Disable wmgUseMdotRouting on testwiki in prod (T401595)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:01:43] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[00:06:12] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Add slack_bot_token Bug: T401730 [labs/private] - 10https://gerrit.wikimedia.org/r/1184613 (https://phabricator.wikimedia.org/T401730)
[00:06:58] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183700|Disable wmgUseMdotRouting on testwiki in prod (T401595)]] (duration: 09m 30s)
[00:07:02] <stashbot>	 T401595: [Rollout Phase 1] Implement unified mobile routing and enable on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595
[00:07:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:08:10] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184614
[00:08:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184614 (owner: 10TrainBranchBot)
[00:12:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:14:55] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Add Slack route to send Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730)
[00:14:55] <wikibugs>	 (03CR) 10Andrea Denisse: "Hi folks, I tested this in Pontoon by sending alerts to the #api-alerts and #api-alerts-test Slack channels." [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse)
[00:16:39] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] alert: Add Slack route to send Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse)
[00:16:49] <wikibugs>	 (03CR) 10Andrea Denisse: alert: Add Slack route to send Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse)
[00:17:13] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+2 C:03+2] alert: Add slack_bot_token Bug: T401730 [labs/private] - 10https://gerrit.wikimedia.org/r/1184613 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse)
[00:23:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184614 (owner: 10TrainBranchBot)
[00:26:13] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182674 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[00:28:17] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182674 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[00:29:30] <wikibugs>	 (03PS1) 10Jforrester: EditAttemptStep: don't error if something is blocking session logging [extensions/WikimediaEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184618 (https://phabricator.wikimedia.org/T403656)
[00:29:40] <wikibugs>	 (03PS1) 10Jforrester: EditAttemptStep: don't error if something is blocking session logging [extensions/WikimediaEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184619 (https://phabricator.wikimedia.org/T403656)
[00:36:43] <wikibugs>	 (03PS3) 10RLazarus: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182675 (https://phabricator.wikimedia.org/T403101)
[00:37:34] <wikibugs>	 06SRE, 06Traffic, 10MediaWiki-Platform-Team (Radar): Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11146274 (10Krinkle)
[00:38:57] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182675 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[00:40:00] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-09-02-205403 to 2025-09-04-003606 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184620 (https://phabricator.wikimedia.org/T397956)
[00:40:42] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182675 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[00:40:50] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-09-02-205403 to 2025-09-04-003606 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184620 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester)
[00:41:33] <wikibugs>	 (03PS3) 10RLazarus: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182676 (https://phabricator.wikimedia.org/T403101)
[00:42:39] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-09-02-205403 to 2025-09-04-003606 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184620 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester)
[00:43:56] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[00:44:00] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182676 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[00:44:11] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[00:44:28] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[00:45:01] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[00:45:06] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[00:45:39] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[00:46:05] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182676 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus)
[00:58:14] <Kemayo>	 Heads up that I'm about to deploy two backports for an UBN.
[00:59:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184618 (https://phabricator.wikimedia.org/T403656) (owner: 10Jforrester)
[00:59:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184619 (https://phabricator.wikimedia.org/T403656) (owner: 10Jforrester)
[01:00:20] <wikibugs>	 (03Merged) 10jenkins-bot: EditAttemptStep: don't error if something is blocking session logging [extensions/WikimediaEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184618 (https://phabricator.wikimedia.org/T403656) (owner: 10Jforrester)
[01:00:40] <wikibugs>	 (03Merged) 10jenkins-bot: EditAttemptStep: don't error if something is blocking session logging [extensions/WikimediaEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184619 (https://phabricator.wikimedia.org/T403656) (owner: 10Jforrester)
[01:01:11] <logmsgbot>	 !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1184618|EditAttemptStep: don't error if something is blocking session logging (T403656)]], [[gerrit:1184619|EditAttemptStep: don't error if something is blocking session logging (T403656)]]
[01:01:15] <stashbot>	 T403656: "Invalid response from server" rarely appearing when attempting to save edits - https://phabricator.wikimedia.org/T403656
[01:01:37] <perryprog>	 Kemayo you're my hero <3
[01:02:18] <Kemayo>	 perryprog: it was a pretty bad one! Though also specific enough to trigger that I can see why we didn't catch it.
[01:02:53] <perryprog>	 it's been driving me nuts since I exclusively use the 2017 editor too. :D
[01:04:04] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[01:06:13] <logmsgbot>	 !log kemayo@deploy1003 jforrester, kemayo: Backport for [[gerrit:1184618|EditAttemptStep: don't error if something is blocking session logging (T403656)]], [[gerrit:1184619|EditAttemptStep: don't error if something is blocking session logging (T403656)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[01:06:16] <stashbot>	 T403656: "Invalid response from server" rarely appearing when attempting to save edits - https://phabricator.wikimedia.org/T403656
[01:08:00] <logmsgbot>	 !log kemayo@deploy1003 jforrester, kemayo: Continuing with sync
[01:13:18] <logmsgbot>	 !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184618|EditAttemptStep: don't error if something is blocking session logging (T403656)]], [[gerrit:1184619|EditAttemptStep: don't error if something is blocking session logging (T403656)]] (duration: 12m 07s)
[01:13:22] <stashbot>	 T403656: "Invalid response from server" rarely appearing when attempting to save edits - https://phabricator.wikimedia.org/T403656
[01:14:07] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T402925)', diff saved to https://phabricator.wikimedia.org/P82515 and previous config saved to /var/cache/conftool/dbconfig/20250904-011407-ladsgroup.json
[01:14:11] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[01:29:15] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P82516 and previous config saved to /var/cache/conftool/dbconfig/20250904-012914-ladsgroup.json
[01:33:56] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[01:35:06] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[01:44:23] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P82517 and previous config saved to /var/cache/conftool/dbconfig/20250904-014422-ladsgroup.json
[01:48:56] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[01:59:30] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T402925)', diff saved to https://phabricator.wikimedia.org/P82518 and previous config saved to /var/cache/conftool/dbconfig/20250904-015929-ladsgroup.json
[01:59:34] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[01:59:45] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1249.eqiad.wmnet with reason: Maintenance
[01:59:53] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1249 (T402925)', diff saved to https://phabricator.wikimedia.org/P82519 and previous config saved to /var/cache/conftool/dbconfig/20250904-015952-ladsgroup.json
[02:29:09] <wikibugs>	 (03Abandoned) 10Samwilson: CommonSettings: Add CommunityRequests projects and group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184375 (https://phabricator.wikimedia.org/T393860) (owner: 10Samwilson)
[02:33:56] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[02:48:56] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:58:56] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:02:23] <wikibugs>	 (03CR) 10Mmta: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[03:03:56] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:08:56] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[03:17:08] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:28:58] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:29:48] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:32:58] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:36:50] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:32:21] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T402925)', diff saved to https://phabricator.wikimedia.org/P82520 and previous config saved to /var/cache/conftool/dbconfig/20250904-043220-ladsgroup.json
[04:32:25] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[04:47:29] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P82521 and previous config saved to /var/cache/conftool/dbconfig/20250904-044728-ladsgroup.json
[04:53:56] <jinxer-wm>	 RESOLVED: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[05:02:36] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P82522 and previous config saved to /var/cache/conftool/dbconfig/20250904-050235-ladsgroup.json
[05:04:04] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[05:08:56] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:11:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:17:44] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T402925)', diff saved to https://phabricator.wikimedia.org/P82523 and previous config saved to /var/cache/conftool/dbconfig/20250904-051743-ladsgroup.json
[05:17:47] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[05:17:59] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1252.eqiad.wmnet with reason: Maintenance
[05:18:07] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T402925)', diff saved to https://phabricator.wikimedia.org/P82524 and previous config saved to /var/cache/conftool/dbconfig/20250904-051806-ladsgroup.json
[05:33:56] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[05:33:56] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:35:06] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:48:56] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[05:55:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184549 (https://phabricator.wikimedia.org/T402369) (owner: 10Phuedx)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T0600).
[06:23:41] <phuedx>	 jouncebot next
[06:23:41] <jouncebot>	 In 0 hour(s) and 36 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T0700)
[06:24:17] <phuedx>	 I'll be 5-10 minutes late to the UTC morning backport window :)
[06:35:04] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] "Done" [alerts] - 10https://gerrit.wikimedia.org/r/1184378 (owner: 10Arnaudb)
[06:36:27] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[06:37:00] <wikibugs>	 (03Merged) 10jenkins-bot: gitlab: alert on sidekiq queue piling up [alerts] - 10https://gerrit.wikimedia.org/r/1184378 (owner: 10Arnaudb)
[06:42:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[06:43:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11146541 (10VRiley-WMF)
[06:43:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11146542 (10VRiley-WMF) I have noticed some of these cables are active and currently connected.
[06:45:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "This would work, alternative fix inline" [puppet] - 10https://gerrit.wikimedia.org/r/1184510 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[06:46:10] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[06:48:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for astinson [puppet] - 10https://gerrit.wikimedia.org/r/1184641
[06:48:56] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:57:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for astinson [puppet] - 10https://gerrit.wikimedia.org/r/1184641 (owner: 10Muehlenhoff)
[06:58:56] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T0700).
[07:00:05] <jouncebot>	 abijeet and phuedx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11146590 (10elukey) >>! In T392851#11144994, @RobH wrote: > cp2045 has had the idrac, bios, and SSD firmware updated to latest revisions to match cp2043. >  > P...
[07:01:08] <kart_>	 abijeet will be late for the deployment.
[07:01:21] <kart_>	 phuedx: you can start with your patch.
[07:02:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3005.esams.wmnet with OS bookworm
[07:03:56] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:08:56] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[07:13:34] <phuedx>	 kart_: o/ I'm back
[07:13:41] <phuedx>	 kart_: Are you deploying your patch?
[07:20:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184549 (https://phabricator.wikimedia.org/T402369) (owner: 10Phuedx)
[07:20:56] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 5400
[07:21:23] <wikibugs>	 (03Merged) 10jenkins-bot: MetricsPlatform: Enable overrides everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184549 (https://phabricator.wikimedia.org/T402369) (owner: 10Phuedx)
[07:21:25] <abijeet>	 hello
[07:21:52] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1184549|MetricsPlatform: Enable overrides everywhere (T402369)]]
[07:21:55] <stashbot>	 T402369: Send analytics events for overridden experiments to the console - https://phabricator.wikimedia.org/T402369
[07:22:15] <kart_>	 abijeet: hola
[07:23:03] <wikibugs>	 (03PS1) 10Abijeet Patro: Revert^2 "TranslationUnitDTO: Make blob type properties writable" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184645
[07:23:30] <abijeet>	 kart_, 1184645: Revert^2 "TranslationUnitDTO: Make blob type properties writable" | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1184645
[07:23:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage
[07:23:53] <kart_>	 abijeet: thanks. I was about to ping for that.
[07:24:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184645 (owner: 10Abijeet Patro)
[07:24:18] <abijeet>	 kart_, I've scheduled it for deployment
[07:26:18] <kart_>	 thanks. let's wait for CI
[07:26:30] <kart_>	 and we can do that after phuedx's deployment is done.
[07:26:43] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1184549|MetricsPlatform: Enable overrides everywhere (T402369)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:26:51] <phuedx>	 Verifying
[07:26:52] <abijeet>	 kart_, thanks
[07:29:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage
[07:30:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: interface: create rt_tables.d as needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184510 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[07:30:48] <wikibugs>	 (03PS1) 10Slyngshede: P:puppetserver::volatile avoid loading Spur data on certain host [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616)
[07:31:17] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 5400
[07:32:06] <phuedx>	 LGTM. The appropriate methods are available in the JS SDK. I tested overriding as a logged-in user on the Beta Cluster and on a production wiki. Continuing
[07:32:10] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Continuing with sync
[07:33:24] <wikibugs>	 (03PS2) 10Filippo Giunchedi: interface: create rt_tables.d as needed [puppet] - 10https://gerrit.wikimedia.org/r/1184510 (https://phabricator.wikimedia.org/T401899)
[07:33:24] <wikibugs>	 (03PS2) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899)
[07:33:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[07:35:00] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6842/console" [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede)
[07:37:02] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6843/console" [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede)
[07:37:25] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184549|MetricsPlatform: Enable overrides everywhere (T402369)]] (duration: 15m 33s)
[07:37:29] <stashbot>	 T402369: Send analytics events for overridden experiments to the console - https://phabricator.wikimedia.org/T402369
[07:37:38] <phuedx>	 kart_, abijeet: Over to you :)
[07:37:57] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 28126
[07:38:24] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28126
[07:38:39] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 262777
[07:38:58] <kart_>	 Thanks phuedx 
[07:39:00] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262777
[07:39:05] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 41327
[07:39:45] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 41327
[07:39:49] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 267536
[07:40:12] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 267536
[07:40:17] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 269396
[07:40:35] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 269396
[07:40:38] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 268795
[07:40:50] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268795
[07:41:00] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 273363
[07:41:14] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 273363
[07:41:18] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 52968
[07:41:40] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52968
[07:41:44] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 266240
[07:42:26] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 266240
[07:42:31] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 9002
[07:43:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184645 (owner: 10Abijeet Patro)
[07:43:24] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9002
[07:43:26] <kart_>	 abijeet: started.
[07:43:31] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 270735
[07:43:48] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270735
[07:43:49] <kart_>	 CI ETA 11-12 minutes
[07:43:56] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 199710
[07:43:56] <abijeet>	 kart_, ok
[07:44:28] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199710
[07:44:40] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 262412
[07:44:54] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262412
[07:45:00] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 139628
[07:45:24] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "TranslationUnitDTO: Make blob type properties writable" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184645 (owner: 10Abijeet Patro)
[07:45:34] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 139628
[07:45:40] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 265249
[07:45:49] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1184645|Revert^2 "TranslationUnitDTO: Make blob type properties writable"]]
[07:46:13] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 265249
[07:46:22] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 264011
[07:46:25] <wikibugs>	 (03CR) 10KartikMistry: "Please go ahead @jrobson@wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 (owner: 10Jdlrobson)
[07:47:51] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264011
[07:49:23] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 45014
[07:49:35] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45014
[07:49:40] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 52762
[07:49:54] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52762
[07:49:57] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 272207
[07:50:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3005.esams.wmnet with OS bookworm
[07:50:11] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 272207
[07:50:15] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 262316
[07:50:27] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262316
[07:50:30] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 28652
[07:50:42] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28652
[07:50:46] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 265966
[07:50:47] <abijeet_>	 kart_, im here
[07:51:01] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 265966
[07:51:05] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 263908
[07:51:24] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263908
[07:51:29] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 269548
[07:51:38] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 269548
[07:51:44] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 266539
[07:51:51] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede)
[07:51:54] <kart_>	 abijeet_: on the change, will ping for testing.
[07:51:55] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 266539
[07:51:58] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 267614
[07:52:00] <logmsgbot>	 !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1184645|Revert^2 "TranslationUnitDTO: Make blob type properties writable"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:52:02] <abijeet_>	 kart_, ok
[07:52:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti3005 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184705 (https://phabricator.wikimedia.org/T402259)
[07:52:11] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 267614
[07:52:14] <kart_>	 abijeet_: you can test now
[07:52:15] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 53066
[07:52:42] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 53066
[07:52:46] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 262662
[07:53:04] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262662
[07:53:09] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 268188
[07:53:28] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268188
[07:53:32] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 7063
[07:53:44] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7063
[07:53:47] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 273421
[07:53:56] <abijeet_>	 kart_, ok
[07:54:05] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 273421
[07:54:09] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 212635
[07:54:54] <abijeet_>	 kart_, tested, works
[07:55:16] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add ganeti3005 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184705 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[07:55:26] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 212635
[07:55:36] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 28604
[07:55:44] <kart_>	 cool. syncing.
[07:55:55] <logmsgbot>	 !log kartik@deploy1003 abi, kartik: Continuing with sync
[07:55:56] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28604
[07:56:00] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 263016
[07:56:11] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263016
[07:56:18] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 263270
[07:56:40] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263270
[07:56:44] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 150178
[07:57:01] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 150178
[07:57:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add ganeti3005 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184705 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[07:57:04] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 270364
[07:57:16] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270364
[07:57:20] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 40731
[07:57:29] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40731
[07:57:32] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 268197
[07:57:42] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268197
[07:57:46] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 267517
[07:57:57] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 267517
[07:58:02] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 264927
[07:58:16] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264927
[07:58:21] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 7679
[07:59:01] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7679
[07:59:32] <abijeet_>	 kart_, we can deploy
[07:59:33] <XioNoX>	 alright I'm done :)
[07:59:46] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T402925)', diff saved to https://phabricator.wikimedia.org/P82525 and previous config saved to /var/cache/conftool/dbconfig/20250904-075945-ladsgroup.json
[07:59:50] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[08:00:05] <jouncebot>	 dancy and andre: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T0800).
[08:01:19] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184645|Revert^2 "TranslationUnitDTO: Make blob type properties writable"]] (duration: 15m 30s)
[08:01:27] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184706
[08:02:58] <abijeet_>	 kart_, all done?
[08:04:09] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] P:puppetserver::volatile avoid loading Spur data on certain host [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede)
[08:04:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] interface: create rt_tables.d as needed [puppet] - 10https://gerrit.wikimedia.org/r/1184510 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[08:05:02] <wikibugs>	 (03CR) 10Muehlenhoff: wmcs: port ::instance to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[08:05:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[08:07:06] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:07:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[08:08:56] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:08:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet
[08:09:10] <kart_>	 abijeet_: sorry. yes. All done.
[08:10:57] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[08:14:27] <wikibugs>	 (03CR) 10Federico Ceratto: "See comment in https://phabricator.wikimedia.org/T403617#11143918" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[08:14:54] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P82526 and previous config saved to /var/cache/conftool/dbconfig/20250904-081453-ladsgroup.json
[08:16:36] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 150178
[08:17:22] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 150178
[08:18:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3005.esams.wmnet
[08:20:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3005.esams.wmnet to cluster esams03 and group B
[08:21:09] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[08:21:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3005.esams.wmnet to cluster esams03 and group B
[08:22:41] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[08:28:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to <wmde and nda>for <mahmoud-abdelsattar> - https://phabricator.wikimedia.org/T403695 (10mahmoud.abdelsattar.wmde) 03NEW
[08:29:49] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "Can you run a pcc?" [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[08:30:01] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P82527 and previous config saved to /var/cache/conftool/dbconfig/20250904-083001-ladsgroup.json
[08:30:17] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T400968#11146759 (10BTullis) 05Open→03Resolved
[08:31:23] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.ganeti.makevm for new host atlas3001.wikimedia.org
[08:31:25] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[08:31:38] <wikibugs>	 (03CR) 10KCVelaga: [C:03+1] "@kartik.mistry@gmail.com can you help with deployment for this one?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182621 (https://phabricator.wikimedia.org/T398057) (owner: 10KCVelaga)
[08:32:58] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[08:33:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Add prometheus3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184707 (https://phabricator.wikimedia.org/T403620)
[08:33:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[08:33:15] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11146776 (10Peter)
[08:35:36] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1003"
[08:35:41] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1003"
[08:35:41] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:35:41] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache atlas3001.wikimedia.org on all recursors
[08:35:44] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas3001.wikimedia.org on all recursors
[08:36:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d8-eqiad - https://phabricator.wikimedia.org/T401240#11146785 (10VRiley-WMF) |Device A|Device A Port|Device B|Device B Port|Type|Notes|Length required| |----------|-----------------|----------|----------|-------|-----|-------------...
[08:36:14] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas3001.wikimedia.org - ayounsi@cumin1003"
[08:36:18] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas3001.wikimedia.org - ayounsi@cumin1003"
[08:36:19] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host atlas3001.wikimedia.org
[08:37:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11146787 (10VRiley-WMF) Was informed to update this with more information, but found out I already has updated this at 12:18PM
[08:37:30] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to <wmde and nda>for <mahmoud-abdelsattar> - https://phabricator.wikimedia.org/T403695#11146786 (10mahmoud.abdelsattar.wmde) Logstash access is denied when I try to access it using my SSO.
[08:38:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[08:39:54] <elukey>	 !log kill and restart imposm on maps-test2001 - stuck since August 10, lag building up and alerts
[08:39:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:00] <elukey>	 moritzm: --^
[08:42:43] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add prometheus3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184707 (https://phabricator.wikimedia.org/T403620) (owner: 10Muehlenhoff)
[08:43:28] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] Add prometheus3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184707 (https://phabricator.wikimedia.org/T403620) (owner: 10Muehlenhoff)
[08:44:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: "https://puppet-compiler.wmflabs.org/output/1184511/6845/" [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[08:44:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] wmcs: port ::instance to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[08:45:09] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T402925)', diff saved to https://phabricator.wikimedia.org/P82528 and previous config saved to /var/cache/conftool/dbconfig/20250904-084508-ladsgroup.json
[08:45:13] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[08:45:14] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[08:47:44] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11146860 (10Peter)
[08:49:56] <wikibugs>	 (03CR) 10Tiziano Fogli: "Thanks, I'm going to fix it." [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (owner: 10Tiziano Fogli)
[08:50:12] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11146872 (10Clement_Goubert) >>! In T400661#11143881, @Jhancock.wm wrote: > @jasmine_ how do you feel about the server going in row D? doesn't look like we have one in that row....
[08:50:42] <wikibugs>	 (03PS2) 10Tiziano Fogli: nrpe2nodexp: add alertmanager_team param to override role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (https://phabricator.wikimedia.org/T395446)
[08:51:16] <wikibugs>	 (03CR) 10Btullis: dse-k8s: Introduce opensearch-operator namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184568 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[08:52:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add prometheus3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184707 (https://phabricator.wikimedia.org/T403620) (owner: 10Muehlenhoff)
[08:53:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] nrpe2nodexp: add alertmanager_team param to override role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli)
[08:53:32] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11146879 (10ayounsi)
[08:54:21] <wikibugs>	 (03PS3) 10Tiziano Fogli: nrpe2nodexp: add alertmanager_team param to override role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (https://phabricator.wikimedia.org/T395446)
[08:56:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host prometheus3004.esams.wmnet
[08:56:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:57:48] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] nrpe2nodexp: add alertmanager_team param to override role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli)
[08:59:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3004.esams.wmnet - jmm@cumin2002"
[08:59:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3004.esams.wmnet - jmm@cumin2002"
[08:59:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:00:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache prometheus3004.esams.wmnet on all recursors
[09:00:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3004.esams.wmnet on all recursors
[09:00:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3004.esams.wmnet - jmm@cumin2002"
[09:00:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3004.esams.wmnet - jmm@cumin2002"
[09:03:56] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:04:01] <logmsgbot>	 jmm@cumin2002 makevm (PID 1965233) is awaiting input
[09:04:04] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:04:17] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to <wmde and nda>for <mahmoud-abdelsattar> - https://phabricator.wikimedia.org/T403695#11146922 (10karapayneWMDE) Hello! Wikidata EM at WMDE here. Can confirm that Mahmoud is our new Staff Engineer
[09:04:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus3004.esams.wmnet with OS bookworm
[09:07:07] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:07:39] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] esams: remove sandbox filter [homer/public] - 10https://gerrit.wikimedia.org/r/1184507 (owner: 10Ayounsi)
[09:09:00] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to <wmde and nda>for <mahmoud-abdelsattar> - https://phabricator.wikimedia.org/T403695#11146942 (10Vgutierrez) a:05Vgutierrez→03JMeybohm assigning the task to @JMeybohm, he is the SRE on clinic du...
[09:09:29] <wikibugs>	 (03PS1) 10Cathal Mooney: Nokia: Save some vars to data{} dict so it only needs to be done once [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577)
[09:09:31] <wikibugs>	 (03Merged) 10jenkins-bot: esams: remove sandbox filter [homer/public] - 10https://gerrit.wikimedia.org/r/1184507 (owner: 10Ayounsi)
[09:10:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmcs: remove HighIOWaitStalling [alerts] - 10https://gerrit.wikimedia.org/r/1184715 (https://phabricator.wikimedia.org/T328502)
[09:11:09] <wikibugs>	 (03PS1) 10Btullis: Revert "Facilitate a role swap between an-mariadb1001 and an-mariadb1002" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184716
[09:11:41] <wikibugs>	 (03PS1) 10Btullis: Revert "Use the standby analytics_meta mariadb server temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/1184717
[09:12:00] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney)
[09:12:21] <wikibugs>	 (03CR) 10David Caro: [C:03+2] replica_cnf: disable ssl by default on replica.cnf files [puppet] - 10https://gerrit.wikimedia.org/r/1184484 (https://phabricator.wikimedia.org/T182892) (owner: 10David Caro)
[09:13:56] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:14:48] <wikibugs>	 (03Abandoned) 10Muehlenhoff: profile::wmcs::instance: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971455 (owner: 10Muehlenhoff)
[09:15:13] <wikibugs>	 (03CR) 10FNegri: [C:03+1] wmcs: remove HighIOWaitStalling [alerts] - 10https://gerrit.wikimedia.org/r/1184715 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi)
[09:15:52] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff)
[09:15:55] <wikibugs>	 (03CR) 10David Caro: [C:03+1] wmcs: remove HighIOWaitStalling [alerts] - 10https://gerrit.wikimedia.org/r/1184715 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi)
[09:15:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] wmcs: remove HighIOWaitStalling [alerts] - 10https://gerrit.wikimedia.org/r/1184715 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi)
[09:17:22] <wikibugs>	 (03Abandoned) 10Muehlenhoff: smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) (owner: 10Muehlenhoff)
[09:18:51] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to <wmde and nda>for <mahmoud-abdelsattar> - https://phabricator.wikimedia.org/T403695#11146990 (10JMeybohm) @KFrancis could you please verify/ensure the NDA has been signed?
[09:23:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus3004.esams.wmnet with reason: host reimage
[09:24:16] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Revert "Use the standby analytics_meta mariadb server temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/1184717 (owner: 10Btullis)
[09:26:14] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2141.codfw.wmnet with reason: Maintenance
[09:26:32] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2145.codfw.wmnet with reason: Maintenance
[09:26:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T401906)', diff saved to https://phabricator.wikimedia.org/P82529 and previous config saved to /var/cache/conftool/dbconfig/20250904-092639-fceratto.json
[09:26:43] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[09:28:26] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Use the standby analytics_meta mariadb server temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/1184717 (owner: 10Btullis)
[09:28:27] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 7679
[09:28:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T401906)', diff saved to https://phabricator.wikimedia.org/P82530 and previous config saved to /var/cache/conftool/dbconfig/20250904-092849-fceratto.json
[09:28:58] <logmsgbot>	 !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 7679
[09:29:07] <icinga-wm>	 PROBLEM - MariaDB Replica IO: analytics_meta on db1208 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: could not find next log: the first event . at 4, the last event read from analytics-meta-bin.000323 at 536559012, the last byte read from analytics-meta-bin.000323 at 536559043. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%2
[09:29:07] <icinga-wm>	 ng_a_replica
[09:29:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus3004.esams.wmnet with reason: host reimage
[09:29:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove the update definition for thirdparty/helm3 [puppet] - 10https://gerrit.wikimedia.org/r/1184719
[09:31:27] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Revert "Facilitate a role swap between an-mariadb1001 and an-mariadb1002" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184716 (owner: 10Btullis)
[09:33:24] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Facilitate a role swap between an-mariadb1001 and an-mariadb1002" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184716 (owner: 10Btullis)
[09:33:56] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[09:35:06] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Facilitate a role swap between an-mariadb1001 and an-mariadb1002" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184716 (owner: 10Btullis)
[09:35:06] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[09:36:50] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply
[09:37:08] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Send client TLS fingerprint to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270)
[09:37:17] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply
[09:37:25] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply
[09:37:45] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply
[09:37:59] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[09:38:07] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[09:38:17] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production
[09:40:33] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production
[09:41:00] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[09:41:06] <icinga-wm>	 RECOVERY - MariaDB Replica IO: analytics_meta on db1208 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:42:59] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[09:43:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P82532 and previous config saved to /var/cache/conftool/dbconfig/20250904-094357-fceratto.json
[09:44:04] <wikibugs>	 (03CR) 10Tiziano Fogli: "The /var/lib/prometheus directory is created with 0755 permissions by the prometheus-node-exporter Debian package, along with the promethe" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[09:46:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus3004.esams.wmnet with OS bookworm
[09:46:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus3004.esams.wmnet
[09:48:31] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "The helm repo has moved recently:" [puppet] - 10https://gerrit.wikimedia.org/r/1184719 (owner: 10Muehlenhoff)
[09:48:56] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:50:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "wmcs: port ::instance to firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1184724
[09:51:01] <wikibugs>	 (03CR) 10David Caro: [C:03+1] Revert "wmcs: port ::instance to firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1184724 (owner: 10Filippo Giunchedi)
[09:51:07] <wikibugs>	 (03CR) 10FNegri: [C:03+1] Revert "wmcs: port ::instance to firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1184724 (owner: 10Filippo Giunchedi)
[09:51:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] Revert "wmcs: port ::instance to firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1184724 (owner: 10Filippo Giunchedi)
[09:54:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast3007.wikimedia.org
[09:54:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:54:47] <wikibugs>	 (03PS1) 10David Caro: helm: update the repo [puppet] - 10https://gerrit.wikimedia.org/r/1184726
[09:55:26] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11147104 (10Peter)
[09:58:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3007.wikimedia.org - jmm@cumin2002"
[09:59:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P82533 and previous config saved to /var/cache/conftool/dbconfig/20250904-095904-fceratto.json
[09:59:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3007.wikimedia.org - jmm@cumin2002"
[09:59:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:59:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast3007.wikimedia.org on all recursors
[09:59:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast3007.wikimedia.org on all recursors
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1000)
[10:00:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast3007.wikimedia.org - jmm@cumin2002"
[10:00:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast3007.wikimedia.org - jmm@cumin2002"
[10:01:16] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!  As discussed we'll likely need or be able to improve on the purely static approach in time but for now this should work ok." [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi)
[10:02:41] <moritzm>	 !log imported jenkins 2.516.2 for Bullseye/Bookworm T403703
[10:02:43] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro)
[10:02:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:44] <stashbot>	 T403703: Upgrade Jenkins to 2.516.2 - https://phabricator.wikimedia.org/T403703
[10:03:21] <logmsgbot>	 jmm@cumin2002 makevm (PID 1995771) is awaiting input
[10:06:47] <wikibugs>	 (03CR) 10FNegri: [C:03+1] helm: update the repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro)
[10:08:11] <wikibugs>	 (03PS2) 10David Caro: helm: update the repo [puppet] - 10https://gerrit.wikimedia.org/r/1184726
[10:08:11] <wikibugs>	 (03CR) 10David Caro: helm: update the repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro)
[10:08:24] <logmsgbot>	 jmm@cumin2002 makevm (PID 1995771) is awaiting input
[10:09:14] <wikibugs>	 (03CR) 10David Caro: [C:04-1] "The key should be armored, not unarmored, updating" [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro)
[10:10:24] <wikibugs>	 (03PS3) 10David Caro: helm: update the repo [puppet] - 10https://gerrit.wikimedia.org/r/1184726
[10:14:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T401906)', diff saved to https://phabricator.wikimedia.org/P82534 and previous config saved to /var/cache/conftool/dbconfig/20250904-101412-fceratto.json
[10:14:16] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[10:14:28] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2146.codfw.wmnet with reason: Maintenance
[10:14:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast3007.wikimedia.org with OS bookworm
[10:14:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T401906)', diff saved to https://phabricator.wikimedia.org/P82535 and previous config saved to /var/cache/conftool/dbconfig/20250904-101435-fceratto.json
[10:14:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast3007.wikimedia.org with OS bookworm
[10:16:05] <wikibugs>	 (03CR) 10FNegri: [C:03+1] helm: update the repo [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro)
[10:16:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T401906)', diff saved to https://phabricator.wikimedia.org/P82536 and previous config saved to /var/cache/conftool/dbconfig/20250904-101645-fceratto.json
[10:17:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Didn't check the key, but the config change looks good. After merging you can validate by forcing a Puppet run on apt1002 and then running" [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro)
[10:17:35] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Remove the update definition for thirdparty/helm3 [puppet] - 10https://gerrit.wikimedia.org/r/1184719 (owner: 10Muehlenhoff)
[10:17:58] <wikibugs>	 (03CR) 10Ladsgroup: "If this is going to be on all wikis eventually, just put dblist: all instead. You don't need to update this every time you deploy to new w" [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz)
[10:21:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove ncredir3003 [puppet] - 10https://gerrit.wikimedia.org/r/1184062 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[10:24:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir3003.esams.wmnet
[10:26:51] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "I like it, left a comment to add more info but the rest looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1180823 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz)
[10:28:28] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Remove esams RIPE Atlas measurements" [puppet] - 10https://gerrit.wikimedia.org/r/1184731
[10:28:45] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Remove atlas3001 from monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/1184732
[10:29:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:29:37] <wikibugs>	 (03PS2) 10Ayounsi: Revert "Remove atlas3001 from monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/1184732
[10:29:56] <wikibugs>	 (03PS3) 10Ayounsi: Revert "Remove atlas3001 from monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/1184732
[10:31:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P82537 and previous config saved to /var/cache/conftool/dbconfig/20250904-103153-fceratto.json
[10:34:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:36:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:36:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:36:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir3003.esams.wmnet
[10:36:24] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147233 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir3003.esams.wmnet` - ncredir3003.esams.wmnet (**PASS**)   - Downtimed host o...
[10:38:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast3007.wikimedia.org with reason: host reimage
[10:38:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Drop esams01 cluster and reimage ganeti3007 [puppet] - 10https://gerrit.wikimedia.org/r/1184734 (https://phabricator.wikimedia.org/T402259)
[10:38:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove esams01 from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1184735 (https://phabricator.wikimedia.org/T402259)
[10:42:41] <wikibugs>	 (03CR) 10Ladsgroup: "Generally speaking, it looks good to me. I think there have been some concerns over using seconds behind master since it's not as accurate" [alerts] - 10https://gerrit.wikimedia.org/r/1184039 (https://phabricator.wikimedia.org/T315866) (owner: 10Tiziano Fogli)
[10:43:02] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove esams01 from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1184735 (https://phabricator.wikimedia.org/T402259)
[10:43:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast3007.wikimedia.org with reason: host reimage
[10:46:00] <wikibugs>	 (03CR) 10Jcrespo: "So I am not saying what's the solution. There is probably other options, of which I suggested one. What worries me is to make certain thin" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[10:46:04] <wikibugs>	 (03CR) 10Dreamy Jazz: "We don't intend to deploy to all wikis eventually AFAIK." [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz)
[10:47:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P82538 and previous config saved to /var/cache/conftool/dbconfig/20250904-104700-fceratto.json
[10:48:27] <wikibugs>	 (03CR) 10Jcrespo: "> The /var/lib/prometheus directory is created with 0755 permissions by the prometheus-node-exporter Debian package" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[10:55:51] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Nokia: /routing-policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi)
[10:58:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast3007.wikimedia.org with OS bookworm
[10:58:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast3007.wikimedia.org
[10:58:18] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast3007.wikimedia.org with OS bookworm completed: - bast300...
[10:58:33] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Revert "Remove atlas3001 from monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/1184732 (owner: 10Ayounsi)
[10:58:39] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Revert "Remove esams RIPE Atlas measurements" [puppet] - 10https://gerrit.wikimedia.org/r/1184731 (owner: 10Ayounsi)
[10:58:56] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:00:07] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Drop esams01 cluster and reimage ganeti3007 [puppet] - 10https://gerrit.wikimedia.org/r/1184734 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[11:00:21] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Remove esams01 from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1184735 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[11:02:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T401906)', diff saved to https://phabricator.wikimedia.org/P82539 and previous config saved to /var/cache/conftool/dbconfig/20250904-110207-fceratto.json
[11:02:11] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[11:02:23] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2153.codfw.wmnet with reason: Maintenance
[11:02:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T401906)', diff saved to https://phabricator.wikimedia.org/P82540 and previous config saved to /var/cache/conftool/dbconfig/20250904-110230-fceratto.json
[11:02:37] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Nokia: /routing-policy [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi)
[11:03:56] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia: /routing-policy [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi)
[11:04:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T401906)', diff saved to https://phabricator.wikimedia.org/P82541 and previous config saved to /var/cache/conftool/dbconfig/20250904-110440-fceratto.json
[11:06:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Readd bast3007 as bastion node [puppet] - 10https://gerrit.wikimedia.org/r/1184742 (https://phabricator.wikimedia.org/T402259)
[11:08:30] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11147326 (10LSobanski) Just a heads up that the alert fired again, can it be silenced for another month?
[11:08:56] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[11:09:06] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "Tests are always appreciated" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 (owner: 10Krinkle)
[11:09:41] <wikibugs>	 (03PS2) 10Muehlenhoff: Readd bast3007 as bastion node [puppet] - 10https://gerrit.wikimedia.org/r/1184742 (https://phabricator.wikimedia.org/T402259)
[11:15:09] <wikibugs>	 (03CR) 10Ayounsi: "A few comments but overall lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[11:19:09] <wikibugs>	 (03PS2) 10Cathal Mooney: Nokia: Save some vars to data{} dict so it only needs to be done once [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577)
[11:19:48] <logmsgbot>	 !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@9a6431c] (releasing): Update backup releases Jenkins
[11:19:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P82542 and previous config saved to /var/cache/conftool/dbconfig/20250904-111947-fceratto.json
[11:21:23] <logmsgbot>	 !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@9a6431c] (releasing): Update backup releases Jenkins (duration: 02m 09s)
[11:22:50] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove esams01 from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1184735 (https://phabricator.wikimedia.org/T402259)
[11:27:05] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Initial setup of dbprov1007, dbprov2007 [puppet] - 10https://gerrit.wikimedia.org/r/1182865 (https://phabricator.wikimedia.org/T403166)
[11:27:57] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance
[11:28:05] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2147 (T402925)', diff saved to https://phabricator.wikimedia.org/P82543 and previous config saved to /var/cache/conftool/dbconfig/20250904-112804-ladsgroup.json
[11:28:08] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[11:29:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove esams01 from Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1184735 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[11:31:05] <wikibugs>	 (03CR) 10David Caro: [C:03+2] helm: update the repo [puppet] - 10https://gerrit.wikimedia.org/r/1184726 (owner: 10David Caro)
[11:32:01] <logmsgbot>	 !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@b41bbe7] (releasing): Testing
[11:32:05] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Initial setup of dbprov1007, dbprov2007 [puppet] - 10https://gerrit.wikimedia.org/r/1182865 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo)
[11:32:29] <logmsgbot>	 !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@b41bbe7] (releasing): Testing (duration: 00m 38s)
[11:34:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P82544 and previous config saved to /var/cache/conftool/dbconfig/20250904-113455-fceratto.json
[11:36:19] <logmsgbot>	 !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@b41bbe7] (releasing): Testing
[11:36:45] <logmsgbot>	 !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@b41bbe7] (releasing): Testing (duration: 00m 26s)
[11:39:13] <wikibugs>	 (03CR) 10D3r1ck01: tests: Add test for wmfApplyEtcdDBConfig() (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 (owner: 10Krinkle)
[11:43:26] <logmsgbot>	 !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@b41bbe7] (releasing): Update production releases Jenkins
[11:44:01] <logmsgbot>	 !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@b41bbe7] (releasing): Update production releases Jenkins (duration: 00m 36s)
[11:44:07] <wikibugs>	 (03PS3) 10Cathal Mooney: Nokia: Save some vars to data{} dict so it only needs to be done once [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577)
[11:45:34] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:46:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Drop esams01 cluster and reimage ganeti3007 [puppet] - 10https://gerrit.wikimedia.org/r/1184734 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[11:48:02] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Readd bast3007 as bastion node [puppet] - 10https://gerrit.wikimedia.org/r/1184742 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[11:50:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T401906)', diff saved to https://phabricator.wikimedia.org/P82545 and previous config saved to /var/cache/conftool/dbconfig/20250904-115002-fceratto.json
[11:50:06] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[11:50:18] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2170.codfw.wmnet with reason: Maintenance
[11:50:20] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Nokia: Save some vars to data{} dict so it only needs to be done once [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[11:50:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T401906)', diff saved to https://phabricator.wikimedia.org/P82546 and previous config saved to /var/cache/conftool/dbconfig/20250904-115025-fceratto.json
[11:51:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T401906)', diff saved to https://phabricator.wikimedia.org/P82547 and previous config saved to /var/cache/conftool/dbconfig/20250904-115135-fceratto.json
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1200)
[12:04:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3007.esams.wmnet with OS bookworm
[12:06:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P82548 and previous config saved to /var/cache/conftool/dbconfig/20250904-120646-fceratto.json
[12:06:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Nokia: Save some vars to data{} dict so it only needs to be done once (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[12:08:38] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia: Save some vars to data{} dict so it only needs to be done once [homer/public] - 10https://gerrit.wikimedia.org/r/1184714 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[12:14:01] <arnoldokoth>	 !log Upgrade envoyproxy on vrts1003 T402584
[12:14:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:04] <stashbot>	 T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584
[12:15:25] <wikibugs>	 (03PS1) 10David Caro: helm: use the id of the subkey, not the parent key [puppet] - 10https://gerrit.wikimedia.org/r/1184754
[12:19:32] <wikibugs>	 (03CR) 10David Caro: [C:03+2] helm: use the id of the subkey, not the parent key [puppet] - 10https://gerrit.wikimedia.org/r/1184754 (owner: 10David Caro)
[12:21:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P82549 and previous config saved to /var/cache/conftool/dbconfig/20250904-122153-fceratto.json
[12:22:50] <wikibugs>	 (03PS1) 10Cathal Mooney: Nokia: module to configure BGP in network-instance and add IBGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/1184759 (https://phabricator.wikimedia.org/T402577)
[12:23:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[12:24:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Nokia: module to configure BGP in network-instance and add IBGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/1184759 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[12:25:09] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: add alloy from grafana repo [puppet] - 10https://gerrit.wikimedia.org/r/1184756 (https://phabricator.wikimedia.org/T402611)
[12:26:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Readd bast3007 as bastion node [puppet] - 10https://gerrit.wikimedia.org/r/1184742 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[12:26:10] <wikibugs>	 (03PS2) 10Cathal Mooney: Nokia: module to configure BGP in network-instance and add IBGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/1184759 (https://phabricator.wikimedia.org/T402577)
[12:26:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3007.esams.wmnet with reason: host reimage
[12:27:55] <wikibugs>	 (03PS1) 10Cory Massaro: Increase max recursion depth in the orchestrator's composition language. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766
[12:28:36] <wikibugs>	 (03PS2) 10Cory Massaro: WIP: Increase max recursion depth in the orchestrator's composition language. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766
[12:28:44] <jinxer-wm>	 RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[12:33:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3007.esams.wmnet with reason: host reimage
[12:35:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[12:37:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T401906)', diff saved to https://phabricator.wikimedia.org/P82550 and previous config saved to /var/cache/conftool/dbconfig/20250904-123701-fceratto.json
[12:37:06] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[12:37:17] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2173.codfw.wmnet with reason: Maintenance
[12:37:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T401906)', diff saved to https://phabricator.wikimedia.org/P82551 and previous config saved to /var/cache/conftool/dbconfig/20250904-123723-fceratto.json
[12:38:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T401906)', diff saved to https://phabricator.wikimedia.org/P82552 and previous config saved to /var/cache/conftool/dbconfig/20250904-123833-fceratto.json
[12:40:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[12:43:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[12:47:20] <wikibugs>	 (03PS1) 10Jgreen: Add frmx2002.wikimedia.org A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1184771 (https://phabricator.wikimedia.org/T403673)
[12:48:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[12:53:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P82553 and previous config saved to /var/cache/conftool/dbconfig/20250904-125341-fceratto.json
[12:54:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3007.esams.wmnet with OS bookworm
[12:58:23] <wikibugs>	 (03PS1) 10Slyngshede: P:cache::haproxy avoid hardcoding wme ranges [puppet] - 10https://gerrit.wikimedia.org/r/1184772
[13:00:05] <jouncebot>	 Urbanecm and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:03:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:03:56] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:04:00] <XioNoX>	 !log push pfw policies - T403717
[13:04:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:04] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:08:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P82554 and previous config saved to /var/cache/conftool/dbconfig/20250904-130848-fceratto.json
[13:10:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti3007 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184774 (https://phabricator.wikimedia.org/T402259)
[13:12:15] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184776 (https://phabricator.wikimedia.org/T395130)
[13:13:34] <hashar>	 !log upgrading CI Jenkins | T403703
[13:13:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:37] <stashbot>	 T403703: Upgrade Jenkins to 2.516.2 - https://phabricator.wikimedia.org/T403703
[13:13:56] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:15:49] <wikibugs>	 (03CR) 10Krinkle: "I've removed the overrides via Horizon, logged at:" [puppet] - 10https://gerrit.wikimedia.org/r/1183275 (owner: 10Krinkle)
[13:18:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:20:17] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.rename from dumpsdata1006 to an-worker1235
[13:20:26] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[13:21:00] <wikibugs>	 (03Abandoned) 10Tiziano Fogli: prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184776 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[13:21:34] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add ganeti3007 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184774 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[13:23:05] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus3004: assign prometheus::pop role, setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184778 (https://phabricator.wikimedia.org/T403620)
[13:23:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:23:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11147809 (10Papaul)
[13:23:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T401906)', diff saved to https://phabricator.wikimedia.org/P82555 and previous config saved to /var/cache/conftool/dbconfig/20250904-132356-fceratto.json
[13:24:02] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[13:24:12] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2174.codfw.wmnet with reason: Maintenance
[13:24:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T401906)', diff saved to https://phabricator.wikimedia.org/P82556 and previous config saved to /var/cache/conftool/dbconfig/20250904-132419-fceratto.json
[13:24:28] <wikibugs>	 (03Abandoned) 10Tiziano Fogli: prometheus3004: assign prometheus::pop role, setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184778 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[13:24:29] <logmsgbot>	 !log dcaro@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cloudcephosd1052.eqiad.wmnet with reason: swapping network card
[13:25:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw: document SCS ports in Netbox - https://phabricator.wikimedia.org/T403634#11147841 (10Papaul) p:05Triage→03Medium
[13:25:21] <wikibugs>	 (03PS1) 10Federico Ceratto: updates: correct Suite names for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/1184781 (https://phabricator.wikimedia.org/T397305)
[13:25:22] <wikibugs>	 (03CR) 10Federico Ceratto: "As discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1184781 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto)
[13:25:52] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[13:26:10] <logmsgbot>	 btullis@cumin1003 rename (PID 999795) is awaiting input
[13:26:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T401906)', diff saved to https://phabricator.wikimedia.org/P82557 and previous config saved to /var/cache/conftool/dbconfig/20250904-132630-fceratto.json
[13:28:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:28:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add ganeti3007 to the esams03 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1184774 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[13:29:34] <wikibugs>	 (03CR) 10Jgreen: [C:03+2] Add frmx2002.wikimedia.org A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1184771 (https://phabricator.wikimedia.org/T403673) (owner: 10Jgreen)
[13:29:43] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184783 (https://phabricator.wikimedia.org/T403620)
[13:29:44] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus3004: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1184784 (https://phabricator.wikimedia.org/T403620)
[13:29:46] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus::pop: enable rsyncd on esams [puppet] - 10https://gerrit.wikimedia.org/r/1184785 (https://phabricator.wikimedia.org/T403620)
[13:29:53] <logmsgbot>	 !log jgreen@dns1004 START - running authdns-update
[13:30:04] <logmsgbot>	 !log mforns@deploy1003 Started deploy [analytics/refinery@a1f5011]: Fix for pageview actor automated reasons [analytics/refinery@a1f5011b]
[13:30:53] <logmsgbot>	 !log jgreen@dns1004 END - running authdns-update
[13:31:02] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11147880 (10Jhancock.wm) 05Open→03Resolved
[13:31:25] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dnsdse-k8s-worker1014  - jclark@cumin1002"
[13:31:44] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dnsdse-k8s-worker1014  - jclark@cumin1002"
[13:31:44] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:32:03] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11147887 (10Jhancock.wm) 05Open→03Resolved
[13:32:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:32:57] <logmsgbot>	 !log mforns@deploy1003 Finished deploy [analytics/refinery@a1f5011]: Fix for pageview actor automated reasons [analytics/refinery@a1f5011b] (duration: 02m 52s)
[13:33:28] <logmsgbot>	 !log mforns@deploy1003 Started deploy [analytics/refinery@a1f5011] (thin): Fix for pageview actor automated reasons THIN [analytics/refinery@a1f5011b]
[13:33:57] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[13:34:25] <logmsgbot>	 !log mforns@deploy1003 Finished deploy [analytics/refinery@a1f5011] (thin): Fix for pageview actor automated reasons THIN [analytics/refinery@a1f5011b] (duration: 00m 57s)
[13:34:54] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[13:35:06] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:35:21] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus3004: assign replica_label [puppet] - 10https://gerrit.wikimedia.org/r/1184786 (https://phabricator.wikimedia.org/T403620)
[13:35:38] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6846/co" [puppet] - 10https://gerrit.wikimedia.org/r/1184772 (owner: 10Slyngshede)
[13:35:48] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[13:35:59] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[13:37:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:37:24] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147915 (10MoritzMuehlenhoff)
[13:38:30] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147919 (10MoritzMuehlenhoff)
[13:38:43] <wikibugs>	 (03PS2) 10Slyngshede: P:cache::haproxy avoid hardcoding wme ranges [puppet] - 10https://gerrit.wikimedia.org/r/1184772
[13:38:45] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:38:45] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-worker1235 on all recursors
[13:38:49] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-worker1235 on all recursors
[13:38:49] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1235
[13:39:25] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147920 (10ayounsi)
[13:41:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P82558 and previous config saved to /var/cache/conftool/dbconfig/20250904-134137-fceratto.json
[13:41:44] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:41:51] <logmsgbot>	 btullis@cumin1003 rename (PID 999795) is awaiting input
[13:42:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:42:50] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184783 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[13:42:52] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1235
[13:42:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11147952 (10Jclark-ctr) a:03Jclark-ctr
[13:43:00] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6847/console" [puppet] - 10https://gerrit.wikimedia.org/r/1184772 (owner: 10Slyngshede)
[13:43:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11147954 (10Jclark-ctr)
[13:43:31] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from dumpsdata1006 to an-worker1235
[13:43:38] <wikibugs>	 (03PS1) 10Tiziano Fogli: Revert "prometheus3004: setup firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/1184787
[13:45:18] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] Revert "prometheus3004: setup firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/1184787 (owner: 10Tiziano Fogli)
[13:45:40] <wikibugs>	 (03CR) 10Herron: "Shall we set this to 'b' temporarily, since prom300[34] may be running at the same time?" [puppet] - 10https://gerrit.wikimedia.org/r/1184786 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[13:46:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:46:48] <wikibugs>	 (03PS2) 10Tiziano Fogli: prometheus3004: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1184784 (https://phabricator.wikimedia.org/T403620)
[13:46:48] <wikibugs>	 (03PS2) 10Tiziano Fogli: prometheus::pop: enable rsyncd on esams [puppet] - 10https://gerrit.wikimedia.org/r/1184785 (https://phabricator.wikimedia.org/T403620)
[13:46:48] <wikibugs>	 (03PS2) 10Tiziano Fogli: prometheus3004: assign replica_label [puppet] - 10https://gerrit.wikimedia.org/r/1184786 (https://phabricator.wikimedia.org/T403620)
[13:46:48] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184789 (https://phabricator.wikimedia.org/T403620)
[13:47:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:47:30] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:47:36] <wikibugs>	 (03CR) 10MVernon: [C:03+1] updates: correct Suite names for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/1184781 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto)
[13:48:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184784 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[13:48:54] <wikibugs>	 (03CR) 10Tiziano Fogli: "No, they won’t run concurrently. One will replace the other, with a small gap in between." [puppet] - 10https://gerrit.wikimedia.org/r/1184786 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[13:48:56] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[13:49:38] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11147975 (10MoritzMuehlenhoff)
[13:50:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove VIP for esams01 - jmm@cumin2002"
[13:50:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove VIP for esams01 - jmm@cumin2002"
[13:50:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:50:36] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:50:51] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:51:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: firewall: add LINK_LOCAL sets [puppet] - 10https://gerrit.wikimedia.org/r/1184791 (https://phabricator.wikimedia.org/T401899)
[13:51:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899)
[13:51:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: bird: use LINK_LOCAL sets [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899)
[13:52:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Enables I16048a91 and clean up in I83fe8bede" [puppet] - 10https://gerrit.wikimedia.org/r/1184791 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[13:52:38] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.rename from dumpsdata1007 to an-worker1236
[13:52:58] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[13:54:00] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus3004: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184789 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[13:54:15] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus3004: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1184784 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[13:54:28] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus::pop: enable rsyncd on esams [puppet] - 10https://gerrit.wikimedia.org/r/1184785 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[13:55:07] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus3004: assign replica_label [puppet] - 10https://gerrit.wikimedia.org/r/1184786 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[13:56:28] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming dumpsdata1007 to an-worker1236 - btullis@cumin1003"
[13:56:45] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming dumpsdata1007 to an-worker1236 - btullis@cumin1003"
[13:56:45] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:56:45] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-worker1236 on all recursors
[13:56:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P82560 and previous config saved to /var/cache/conftool/dbconfig/20250904-135645-fceratto.json
[13:56:48] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-worker1236 on all recursors
[13:56:49] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1236
[13:57:08] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1236
[13:57:47] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from dumpsdata1007 to an-worker1236
[13:58:12] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:58:24] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] updates: correct Suite names for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/1184781 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto)
[13:58:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:59:12] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:59:12] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[14:00:23] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:01:02] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[14:01:54] <wikibugs>	 (03PS1) 10Jgreen: nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674)
[14:01:57] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 270735
[14:02:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:02:26] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 270735
[14:02:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen)
[14:03:22] <wikibugs>	 (03CR) 10CDanis: haproxy: Send client TLS fingerprint to varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez)
[14:03:31] <wikibugs>	 (03PS3) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317)
[14:03:31] <wikibugs>	 (03CR) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[14:03:31] <wikibugs>	 (03PS1) 10Jforrester: Stop loading the Graph extension anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317)
[14:03:56] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:04:43] <logmsgbot>	 jclark@cumin1002 provision (PID 1450785) is awaiting input
[14:06:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:07:30] <wikibugs>	 10SRE-SLO: Pyrra calculations for the Initial error budget value of calendar windows - https://phabricator.wikimedia.org/T403729 (10elukey) 03NEW
[14:08:05] <wikibugs>	 10SRE-SLO: Pyrra calculations for the Initial error budget value of calendar windows - https://phabricator.wikimedia.org/T403729#11148078 (10elukey)
[14:08:39] <wikibugs>	 (03CR) 10Herron: [C:03+1] "Nice! 🙌" [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse)
[14:10:31] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy: Send client TLS fingerprint to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270)
[14:10:48] <wikibugs>	 (03PS1) 10Scott French: shellbox-syntaxhighlight: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184177 (https://phabricator.wikimedia.org/T403284)
[14:10:50] <wikibugs>	 (03CR) 10Vgutierrez: haproxy: Send client TLS fingerprint to varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez)
[14:11:09] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:11:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:11:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:11:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1184791 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[14:11:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T401906)', diff saved to https://phabricator.wikimedia.org/P82561 and previous config saved to /var/cache/conftool/dbconfig/20250904-141152-fceratto.json
[14:11:56] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[14:12:09] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2176.codfw.wmnet with reason: Maintenance
[14:12:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T401906)', diff saved to https://phabricator.wikimedia.org/P82562 and previous config saved to /var/cache/conftool/dbconfig/20250904-141215-fceratto.json
[14:13:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet
[14:14:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T401906)', diff saved to https://phabricator.wikimedia.org/P82564 and previous config saved to /var/cache/conftool/dbconfig/20250904-141426-fceratto.json
[14:14:40] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:14:47] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus/esams: remove 3003, add 3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184802 (https://phabricator.wikimedia.org/T403620)
[14:16:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:16:35] <wikibugs>	 (03CR) 10Kosta Harlan: dse-k8s-eqiad: Add ipoid-opensearch namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking)
[14:16:47] <wikibugs>	 (03CR) 10Tiziano Fogli: "To be submitted after the final rsync has completed and the services on 3003 have been stopped" [puppet] - 10https://gerrit.wikimedia.org/r/1184802 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[14:17:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:18:18] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus/esams: remove 3003, add 3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184802 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[14:20:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[14:21:03] <wikibugs>	 (03PS9) 10Bking: dse-k8s-worker: Add sysctl setting that's required for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T402926)
[14:21:12] <wikibugs>	 (03CR) 10Bking: dse-k8s-worker: Add sysctl setting that's required for OpenSearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T402926) (owner: 10Bking)
[14:21:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:22:34] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] shellbox-syntaxhighlight: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184177 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[14:23:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet
[14:24:14] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez)
[14:25:50] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:25:54] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "very nice!" [homer/public] - 10https://gerrit.wikimedia.org/r/1184759 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[14:25:54] <moritzm>	 !log upgrade Envoyproxy on webperf* T402584
[14:25:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:58] <stashbot>	 T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584
[14:26:00] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s-worker: Add sysctl setting that's required for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T402926) (owner: 10Bking)
[14:27:01] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T402925)', diff saved to https://phabricator.wikimedia.org/P82565 and previous config saved to /var/cache/conftool/dbconfig/20250904-142701-ladsgroup.json
[14:27:07] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[14:27:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:28:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3007.esams.wmnet to cluster esams03 and group B
[14:29:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P82566 and previous config saved to /var/cache/conftool/dbconfig/20250904-142933-fceratto.json
[14:30:04] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1430)
[14:31:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3007.esams.wmnet to cluster esams03 and group B
[14:32:00] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11148207 (10MoritzMuehlenhoff)
[14:32:09] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] beta: Update hieradata for fe_vcl_config from Horizon [puppet] - 10https://gerrit.wikimedia.org/r/1183275 (owner: 10Krinkle)
[14:34:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir3005.esams.wmnet to drbd
[14:34:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:34:27] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11148226 (10ops-monitoring-bot) VM ncredir3005.esams.wmnet switching disk type to drbd
[14:38:07] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11148246 (10MoritzMuehlenhoff)
[14:39:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:39:22] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509#11148265 (10MoritzMuehlenhoff)
[14:41:30] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:42:09] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P82567 and previous config saved to /var/cache/conftool/dbconfig/20250904-144208-ladsgroup.json
[14:42:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:44:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir3005.esams.wmnet to drbd
[14:44:13] <icinga-wm>	 PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100%
[14:44:29] <icinga-wm>	 RECOVERY - Host ncredir3005 is UP: PING OK - Packet loss = 0%, RTA = 80.66 ms
[14:44:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P82569 and previous config saved to /var/cache/conftool/dbconfig/20250904-144441-fceratto.json
[14:46:06] <logmsgbot>	 !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mr1-ulsfo with reason: Bgp testing
[14:46:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11148288 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=829d4d0b-c9d0-4961-b07b-d12e8f1ac430) set by pt1979@cumin2002 for 2:00:00 on 1 host(s) and their...
[14:47:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:50:07] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus/esams: remove 3003, add 3004 [puppet] - 10https://gerrit.wikimedia.org/r/1184802 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[14:50:21] <wikibugs>	 (03Abandoned) 10Bking: elastic: add test hieradata to help with LVS migration [puppet] - 10https://gerrit.wikimedia.org/r/1130162 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking)
[14:51:13] <XioNoX>	 !log disable OSPF on mr1-ulsfo to test BGP
[14:51:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:21] <moritzm>	 ^ ncredir3005 was depooled (part of the routed ganeti update)
[14:51:34] <moritzm>	 !log upgrade Envoyproxy on Puppet servers T402584
[14:51:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:41] <stashbot>	 T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584
[14:52:48] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:54:49] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:54:51] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:55:04] <wikibugs>	 (03Abandoned) 10Bking: opensearch-k8s: allow setting vm.max_map_count [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) (owner: 10Ryan Kemper)
[14:55:29] <icinga-wm>	 PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[14:55:38] <wikibugs>	 (03CR) 10CDanis: [C:03+1] haproxy: Send client TLS fingerprint to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez)
[14:55:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] firewall: add LINK_LOCAL sets [puppet] - 10https://gerrit.wikimedia.org/r/1184791 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi)
[14:55:51] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:56:19] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus/esams: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1184814 (https://phabricator.wikimedia.org/T403620)
[14:56:27] <icinga-wm>	 PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[14:56:47] <icinga-wm>	 RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 73.64 ms
[14:56:47] <icinga-wm>	 RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.30 ms
[14:57:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:57:17] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P82570 and previous config saved to /var/cache/conftool/dbconfig/20250904-145716-ladsgroup.json
[14:57:30] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxy: Send client TLS fingerprint to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1184720 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez)
[14:57:52] <wikibugs>	 06SRE, 10envoy, 06serviceops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Apply Envoy updates to wcqs and wdqs hosts - https://phabricator.wikimedia.org/T403738 (10bking) 03NEW
[14:58:52] <godog>	 sigh I missed a ; in ferm, there will be failures
[14:58:57] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:58:58] <godog>	 sending a fixup now
[14:59:35] <logmsgbot>	 jhancock@cumin1002 provision (PID 1576767) is awaiting input
[14:59:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:59:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T401906)', diff saved to https://phabricator.wikimedia.org/P82571 and previous config saved to /var/cache/conftool/dbconfig/20250904-145948-fceratto.json
[14:59:52] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[14:59:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: ferm: fixup LINK_LOCAL definition [puppet] - 10https://gerrit.wikimedia.org/r/1184815
[15:00:04] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2188.codfw.wmnet with reason: Maintenance
[15:00:05] <jouncebot>	 dancy and andre: I, the Bot under the Fountain, call upon thee, The Deployer, to do Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1500).
[15:00:11] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T401906)', diff saved to https://phabricator.wikimedia.org/P82572 and previous config saved to /var/cache/conftool/dbconfig/20250904-150011-fceratto.json
[15:00:15] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:00:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184815 (owner: 10Filippo Giunchedi)
[15:00:43] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:00:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] ferm: fixup LINK_LOCAL definition [puppet] - 10https://gerrit.wikimedia.org/r/1184815 (owner: 10Filippo Giunchedi)
[15:00:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] ferm: fixup LINK_LOCAL definition [puppet] - 10https://gerrit.wikimedia.org/r/1184815 (owner: 10Filippo Giunchedi)
[15:02:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[15:02:22] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:02:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T401906)', diff saved to https://phabricator.wikimedia.org/P82573 and previous config saved to /var/cache/conftool/dbconfig/20250904-150221-fceratto.json
[15:02:31] <wikibugs>	 (03PS1) 10Ayounsi: Add ulsfo private v4 range to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1184816 (https://phabricator.wikimedia.org/T294845)
[15:03:57] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-main-availability esams - https://slo.wikimedia.org/?search=wdqs-main-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[15:04:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:05:07] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus/esams: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1184814 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[15:05:21] <logmsgbot>	 !log tappof@dns1004 START - running authdns-update
[15:06:22] <logmsgbot>	 !log tappof@dns1004 END - running authdns-update
[15:08:57] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:57] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[15:09:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I am no longer in o11y, adding o11y folks instead for deployment" [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen)
[15:11:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11148443 (10Jhancock.wm) @elukey it still fails with just the BIOS update. moving on to idrac and ssd updates.
[15:11:12] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye
[15:12:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[15:12:24] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T402925)', diff saved to https://phabricator.wikimedia.org/P82574 and previous config saved to /var/cache/conftool/dbconfig/20250904-151223-ladsgroup.json
[15:12:28] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[15:12:28] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance
[15:12:35] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T402925)', diff saved to https://phabricator.wikimedia.org/P82575 and previous config saved to /var/cache/conftool/dbconfig/20250904-151235-ladsgroup.json
[15:13:38] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1236.eqiad.wmnet with OS bullseye
[15:13:57] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:13:57] <jinxer-wm>	 RESOLVED: [2x] SLOMetricAbsent: wdqs-main-availability esams - https://slo.wikimedia.org/?search=wdqs-main-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[15:16:37] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:17:15] <wikibugs>	 06SRE, 10Observability-Metrics, 06Traffic: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266#11148493 (10Peachey88) 05Stalled→03Resolved p:05Unbreak!→03Medium a:03ssingh
[15:17:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P82576 and previous config saved to /var/cache/conftool/dbconfig/20250904-151729-fceratto.json
[15:17:34] <moritzm>	 !log installing apache2 security updates
[15:17:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:47] <wikibugs>	 06SRE, 10Observability-Metrics: Port Prometheus dashboards to Thanos - https://phabricator.wikimedia.org/T256954#11148499 (10Aklapper) 05Stalled→03Open p:05Unbreak!→03Medium
[15:18:02] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:20:34] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Good spot yep this is what is needed." [homer/public] - 10https://gerrit.wikimedia.org/r/1184816 (https://phabricator.wikimedia.org/T294845) (owner: 10Ayounsi)
[15:20:46] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:21:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:22:02] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:22:21] <moritzm>	 !log upgrade Envoyproxy on cloudweb servers T402584
[15:22:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:24] <stashbot>	 T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584
[15:22:27] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:23:19] <wikibugs>	 (03PS8) 10Scott French: P:mediawiki::php: add 8.3 and simplify versioning [puppet] - 10https://gerrit.wikimedia.org/r/1184101
[15:24:06] <wikibugs>	 (03CR) 10Scott French: "Great! Thanks again for the review and testing, Timo." [puppet] - 10https://gerrit.wikimedia.org/r/1184101 (owner: 10Scott French)
[15:25:00] <tappof>	 !log migration from prometheus3003.esams to prometheus3004 has been completed T403620
[15:25:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:03] <stashbot>	 T403620: Migrate prometheus3003 to prometheus3004 - https://phabricator.wikimedia.org/T403620
[15:26:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:27:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[15:27:34] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:31:48] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:32:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[15:32:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P82577 and previous config saved to /var/cache/conftool/dbconfig/20250904-153236-fceratto.json
[15:33:56] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:45] <jinxer-wm>	 RESOLVED: [5x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:37:14] <logmsgbot>	 jhancock@cumin1002 provision (PID 1638589) is awaiting input
[15:39:34] <logmsgbot>	 btullis@cumin1003 reimage (PID 1012017) is awaiting input
[15:44:24] <logmsgbot>	 btullis@cumin1003 reimage (PID 1012182) is awaiting input
[15:45:49] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:47:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[15:47:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T401906)', diff saved to https://phabricator.wikimedia.org/P82578 and previous config saved to /var/cache/conftool/dbconfig/20250904-154744-fceratto.json
[15:47:49] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[15:48:00] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2202.codfw.wmnet with reason: Maintenance
[15:48:17] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2203.codfw.wmnet with reason: Maintenance
[15:48:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T401906)', diff saved to https://phabricator.wikimedia.org/P82579 and previous config saved to /var/cache/conftool/dbconfig/20250904-154824-fceratto.json
[15:49:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T401906)', diff saved to https://phabricator.wikimedia.org/P82580 and previous config saved to /var/cache/conftool/dbconfig/20250904-154934-fceratto.json
[15:50:29] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:55:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[15:57:51] <wikibugs>	 (03PS6) 10Federico Ceratto: mysqld_exporter.pp: reset /var/log/prometheus perms [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859)
[16:00:05] <jouncebot>	 jhathaway and moritzm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:05:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:07:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[16:08:57] <wikibugs>	 (03PS1) 10Ahmon Dancy: buildkitd.toml.erb: Temporarily enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1184830 (https://phabricator.wikimedia.org/T396924)
[16:11:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:12:03] <wikibugs>	 (03PS2) 10Jgreen: nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674)
[16:12:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[16:14:17] <wikibugs>	 (03CR) 10Ahmon Dancy: "Cherry-picked to gitlab-runners-puppetserver-01.gitlab-runners.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/1184830 (https://phabricator.wikimedia.org/T396924) (owner: 10Ahmon Dancy)
[16:16:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:16:18] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] P:mediawiki::php: add 8.3 and simplify versioning [puppet] - 10https://gerrit.wikimedia.org/r/1184101 (owner: 10Scott French)
[16:17:00] <swfrench-wmf>	 FYI, the intermittent MediaWikiMemcachedHighErrorRate alerts seem to be due to T401425. I'll follow up there.
[16:17:01] <stashbot>	 T401425: Investigate memcache errors during wikidata and commons dumps runs - https://phabricator.wikimedia.org/T401425
[16:18:53] <wikibugs>	 10SRE-SLO, 10Citoid, 10VisualEditor, 10Editing-team (Kanban Board): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11148790 (10Mvolz) >>! In T345627#11138642, @elukey wrote: > @Mvolz Hi! Sorry for the dela...
[16:19:31] <swfrench-wmf>	 jouncebot: nowandnext
[16:19:31] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1600)
[16:19:31] <jouncebot>	 In 0 hour(s) and 40 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1700)
[16:19:31] <jouncebot>	 In 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1700)
[16:21:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:22:13] <swfrench-wmf>	 FYI, in a few minutes, I'll start piloting a fraction of traffic for one Shellbox service (syntaxhighlight) on PHP 8.3 (T403284). this will be reverted after a couple of hours of testing.
[16:22:13] <stashbot>	 T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284
[16:22:44] <jinxer-wm>	 RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[16:24:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:24:32] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Looks sane to me, but the amount of corner cases that we're tracking is becoming quite worrisome for the long term maintainability of the " [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[16:25:49] <wikibugs>	 (03PS1) 10Hashar: releases-jenkins: fix httpbb monitoring probe [puppet] - 10https://gerrit.wikimedia.org/r/1184870 (https://phabricator.wikimedia.org/T403703)
[16:27:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:28:31] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184177 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[16:28:32] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] releases-jenkins: fix httpbb monitoring probe [puppet] - 10https://gerrit.wikimedia.org/r/1184870 (https://phabricator.wikimedia.org/T403703) (owner: 10Hashar)
[16:28:35] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox-syntaxhighlight: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184177 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[16:29:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:30:17] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: pilot 1 replica on 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184177 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[16:31:07] <mutante>	 re: releases1003 "down" - is not actually down but the HTML content changed due to a version upgrade.  and monitoring checks for a string (that it was hoping could never change, like "log in"
[16:31:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:31:35] <rzl>	 I'm going to roll out some intended-to-be-no-op envoy config changes to {api,rest}-gateway in staging and then eventually prod -- swfrench-wmf I'm happy to wait and go after you if you prefer, but I don't think there should actually be any conflict
[16:31:46] <mutante>	 needless to say.. it changed.. they managed to do that and upstream did "log in" to "Sign in" ...
[16:32:41] <swfrench-wmf>	 rzl: these should be disjoint enough in their effect that I'd say go ahead :)
[16:32:48] <rzl>	 👍
[16:33:06] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[16:33:16] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply
[16:33:40] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[16:33:45] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[16:33:49] <rzl>	 wowee look at us, operating this distributed system concurrently
[16:35:04] <btullis>	 !log upgrading and restarting envoyproxy on cephosd1001 for T402584
[16:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:07] <stashbot>	 T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584
[16:35:10] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2216.codfw.wmnet with reason: Maintenance
[16:35:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T401906)', diff saved to https://phabricator.wikimedia.org/P82581 and previous config saved to /var/cache/conftool/dbconfig/20250904-163517-fceratto.json
[16:35:21] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[16:36:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:36:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11148901 (10RobH) >>! In T402536#11146541, @VRiley-WMF wrote: > I have noticed some of these cables are active and currently connected.  Can you list them out specifically for double checking?  Any of them th...
[16:37:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T401906)', diff saved to https://phabricator.wikimedia.org/P82582 and previous config saved to /var/cache/conftool/dbconfig/20250904-163727-fceratto.json
[16:37:58] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[16:38:29] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[16:39:19] <swfrench-wmf>	 !log started single-replica PHP 8.3 pilot on shellbox-syntaxhighlight in codfw - T403284
[16:39:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:22] <stashbot>	 T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284
[16:39:50] <btullis>	 !log upgrading and restarting envoyproxy on cephosd100[2-5] for T402584
[16:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:01] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[16:42:28] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[16:44:23] <rzl>	 !log deployed chart 0.11.11 to api-gateway and rest-gateway staging, T403101
[16:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:26] <stashbot>	 T403101: Envoy config updates from v1.26 - https://phabricator.wikimedia.org/T403101
[16:44:49] <btullis>	 !log upgrading and restarting envoyproxy on cephosd200[1-3] for T402584
[16:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:52] <stashbot>	 T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584
[16:46:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:50:26] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump to 2025-09-04-122329-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184876
[16:50:34] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:51:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:52:06] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425)
[16:52:09] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[16:52:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P82583 and previous config saved to /var/cache/conftool/dbconfig/20250904-165235-fceratto.json
[16:52:47] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[16:52:58] <swfrench-wmf>	 !log started single-replica PHP 8.3 pilot on shellbox-syntaxhighlight in eqiad - T403284
[16:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:02] <stashbot>	 T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284
[16:54:42] <wikibugs>	 (03PS2) 10BryanDavis: hcaptcha: Redirect / to mw.o project page [puppet] - 10https://gerrit.wikimedia.org/r/1184157
[16:55:00] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425)
[16:55:34] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:55:40] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm
[16:55:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11149025 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm
[16:55:59] <wikibugs>	 (03CR) 10BryanDavis: hcaptcha: Redirect / to mw.o project page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184157 (owner: 10BryanDavis)
[16:56:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:59:29] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[17:00:05] <jouncebot>	 bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1700)
[17:00:15] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[17:00:50] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-09-04-122329-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184876 (owner: 10BryanDavis)
[17:02:11] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[17:02:29] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[17:02:35] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-09-04-122329-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184876 (owner: 10BryanDavis)
[17:03:25] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:03:57] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:04:02] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:04:04] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:04:08] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[17:04:15] <wikibugs>	 (03PS3) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425)
[17:04:20] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[17:04:46] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[17:04:56] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[17:05:26] <wikibugs>	 (03PS4) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425)
[17:07:08] <rzl>	 !log deployed chart 0.11.11 to api-gateway and rest-gateway prod, T403101
[17:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:11] <stashbot>	 T403101: Envoy config updates from v1.26 - https://phabricator.wikimedia.org/T403101
[17:07:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P82584 and previous config saved to /var/cache/conftool/dbconfig/20250904-170743-fceratto.json
[17:08:08] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:08:11] <wikibugs>	 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.26 - https://phabricator.wikimedia.org/T403101#11149158 (10RLazarus)
[17:08:49] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:08:56] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:09:15] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:15:02] <wikibugs>	 (03CR) 10A smart kitten: hcaptcha: Respond with HTTP 405 to disallowed methods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis)
[17:16:23] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Make varnish-frontend-restart work on Beta Cluster - https://phabricator.wikimedia.org/T299054#11149191 (10Krinkle) I'm guessing the below has the same root cause, albeit on a deployment host, not a varnish host.  ` krinkle@deployment-deploy04:~$ sudo tail -n1...
[17:17:04] <wikibugs>	 (03CR) 10Dzahn: "@denisse: has Grafana Alloy been discussed before in observability? I am kind of surprised it seems at the same time common but not used b" [puppet] - 10https://gerrit.wikimedia.org/r/1184756 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb)
[17:20:00] <wikibugs>	 (03CR) 10Vgutierrez: hcaptcha: Respond with HTTP 405 to disallowed methods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis)
[17:20:17] <wikibugs>	 (03PS1) 10Tiziano Fogli: Revert "prometheus::pop: enable rsyncd on esams" [puppet] - 10https://gerrit.wikimedia.org/r/1184883
[17:21:12] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] Revert "prometheus::pop: enable rsyncd on esams" [puppet] - 10https://gerrit.wikimedia.org/r/1184883 (owner: 10Tiziano Fogli)
[17:22:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T401906)', diff saved to https://phabricator.wikimedia.org/P82585 and previous config saved to /var/cache/conftool/dbconfig/20250904-172250-fceratto.json
[17:22:54] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus3003: remove firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184885 (https://phabricator.wikimedia.org/T403620)
[17:22:55] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[17:23:34] <wikibugs>	 (03CR) 10BryanDavis: hcaptcha: Respond with HTTP 405 to disallowed methods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis)
[17:24:00] <wikibugs>	 (03PS5) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425)
[17:27:14] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1235.eqiad.wmnet with OS bullseye
[17:27:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[17:28:04] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus3003: remove firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1184885 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[17:28:06] <wikibugs>	 (03PS4) 10Jdlrobson: Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944
[17:28:32] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1184101 (owner: 10Scott French)
[17:28:35] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1236.eqiad.wmnet with OS bullseye
[17:30:01] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:mediawiki::php: add 8.3 and simplify versioning [puppet] - 10https://gerrit.wikimedia.org/r/1184101 (owner: 10Scott French)
[17:32:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[17:33:56] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[17:34:57] <wikibugs>	 (03PS1) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595)
[17:35:06] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:35:33] <wikibugs>	 (03PS2) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595)
[17:35:35] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[17:36:48] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Add an Allow header on 405 responses [puppet] - 10https://gerrit.wikimedia.org/r/1184887 (https://phabricator.wikimedia.org/T403767)
[17:37:06] <wikibugs>	 (03PS3) 10Aaron Schulz: Add restbase spec JSON files to which /rest_v1/?spec can be routed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175942 (https://phabricator.wikimedia.org/T397203)
[17:37:10] <wikibugs>	 (03PS4) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595)
[17:37:24] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184887 (https://phabricator.wikimedia.org/T403767) (owner: 10Vgutierrez)
[17:38:45] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[17:39:18] <wikibugs>	 (03PS5) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595)
[17:42:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[17:42:51] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus/esams: remove 3003 from prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1184888 (https://phabricator.wikimedia.org/T403620)
[17:43:21] <wikibugs>	 (03PS3) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595)
[17:43:21] <wikibugs>	 (03PS6) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595)
[17:43:29] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[17:44:02] <wikibugs>	 (03PS3) 10BryanDavis: hcaptcha: Redirect / to mw.o project page [puppet] - 10https://gerrit.wikimedia.org/r/1184157
[17:44:03] <wikibugs>	 (03PS2) 10BryanDavis: hcaptcha: Respond with HTTP 405 to disallowed methods [puppet] - 10https://gerrit.wikimedia.org/r/1184158
[17:45:38] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus/esams: remove 3003 from prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1184888 (https://phabricator.wikimedia.org/T403620) (owner: 10Tiziano Fogli)
[17:46:26] <ryankemper>	 !log [WDQS] T403738 Rolling restart of `envoyproxy.service` on `wdqs-main`, 2 hosts at a time
[17:46:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:30] <stashbot>	 T403738: Apply Envoy updates to wcqs and wdqs hosts - https://phabricator.wikimedia.org/T403738
[17:46:40] <wikibugs>	 (03CR) 10BryanDavis: hcaptcha: Respond with HTTP 405 to disallowed methods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis)
[17:46:40] <wikibugs>	 (03CR) 10Reedy: [C:03+1] Stop loading the Graph extension anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[17:47:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[17:48:57] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:52:27] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672)
[17:52:39] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.hosts.decommission for hosts prometheus3003.esams.wmnet
[17:52:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[17:54:11] <wikibugs>	 (03PS2) 10Ahmon Dancy: scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672)
[17:57:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[17:57:24] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.dns.netbox
[17:58:14] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[17:59:49] <wikibugs>	 06SRE, 10envoy, 06serviceops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Apply Envoy updates to wcqs and wdqs hosts - https://phabricator.wikimedia.org/T403738#11149376 (10RKemper) Current status:  [x] WCQS [x] WDQS main [x] WDQS scholarly [] WDQS public
[18:00:04] <jouncebot>	 dancy and andre: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1800).
[18:00:13] <dancy>	 o/
[18:00:41] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184892 (https://phabricator.wikimedia.org/T396378)
[18:00:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184892 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot)
[18:00:56] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1002"
[18:01:26] <wikibugs>	 (03CR) 10Scott French: "Thank you very much, Effie! One issue and one question." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) (owner: 10Effie Mouzeli)
[18:01:35] <logmsgbot>	 !log tappof@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1002"
[18:01:35] <logmsgbot>	 !log tappof@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:01:36] <logmsgbot>	 !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus3003.esams.wmnet
[18:01:43] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184892 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot)
[18:02:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[18:03:12] <wikibugs>	 (03CR) 10Dzahn: "lgtm. should fix issue on beta deployment server. was not an issue in prod because .git already existed. which sounds like a manual init w" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:03:40] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "unfortunately: https://puppet-compiler.wmflabs.org/output/1184891/6850/deploy1003.eqiad.wmnet/change.deploy1003.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:03:57] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job envoy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:04:28] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "so... if already declared at (file: /srv/jenkins/puppet-compiler/6850/change/src/modules/scap/manifests/master.pp, line: 58) then why is i" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:05:49] <wikibugs>	 (03PS4) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595)
[18:06:29] <wikibugs>	 06SRE, 10envoy, 06serviceops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Apply Envoy updates to wcqs and wdqs hosts - https://phabricator.wikimedia.org/T403738#11149401 (10RKemper) a:03RKemper
[18:07:00] <wikibugs>	 (03CR) 10BryanDavis: P:puppetserver::volatile avoid loading Spur data on certain host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede)
[18:07:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[18:07:57] <wikibugs>	 (03CR) 10Krinkle: "PS3 applied in beta cluster (no-op):" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[18:08:11] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[18:08:19] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "we could use "ensure_resource('file', '/srv/patches/.git', {'ensure' => 'directory' })  which should not error out for duplicate declarati" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:08:26] <wikibugs>	 (03PS5) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595)
[18:08:31] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[18:08:56] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T402925)', diff saved to https://phabricator.wikimedia.org/P82586 and previous config saved to /var/cache/conftool/dbconfig/20250904-180855-ladsgroup.json
[18:09:01] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[18:09:10] <wikibugs>	 (03PS1) 10RLazarus: mediawiki: Update to configuration_1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184893 (https://phabricator.wikimedia.org/T403101)
[18:09:43] <logmsgbot>	 !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.17  refs T396378
[18:09:46] <stashbot>	 T396378: 1.45.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T396378
[18:09:50] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "Is profile::kubernetes::deployment_server::mediawiki::release not applied on a deployment server in beta?" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:11:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11149438 (10VRiley-WMF) Oh! I see what you mean. I'll take care of them. I was misreading it. Sorry about that!
[18:12:07] <wikibugs>	 (03PS3) 10Ahmon Dancy: scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672)
[18:12:08] <wikibugs>	 (03CR) 10Krinkle: "https://puppet-compiler.wmflabs.org/output/1184886/4889/cp1102.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[18:12:21] <wikibugs>	 (03CR) 10Dzahn: "ah.. so .. the resource just needs to have a different title to not conflict with the same command for a different .git dir. use "command"" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:12:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11149442 (10VRiley-WMF)
[18:12:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:12:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11149443 (10VRiley-WMF) 05Open→03Resolved
[18:13:39] <wikibugs>	 (03PS4) 10Ahmon Dancy: scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672)
[18:13:56] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job envoy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:14:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:14:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[18:14:39] <wikibugs>	 (03PS5) 10Ahmon Dancy: scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672)
[18:16:29] <wikibugs>	 (03PS1) 10Scott French: mediawiki: ignore kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1184894 (https://phabricator.wikimedia.org/T401425)
[18:16:33] <wikibugs>	 (03PS6) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595)
[18:16:33] <wikibugs>	 (03PS7) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595)
[18:17:16] <wikibugs>	 (03PS3) 10Krinkle: varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510)
[18:17:31] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:19:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[18:23:10] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] mediawiki: ignore kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1184894 (https://phabricator.wikimedia.org/T401425) (owner: 10Scott French)
[18:23:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[18:24:02] <wikibugs>	 (03CR) 10Ahmon Dancy: "Revised." [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:24:04] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P82587 and previous config saved to /var/cache/conftool/dbconfig/20250904-182403-ladsgroup.json
[18:24:27] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mediawiki: ignore kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1184894 (https://phabricator.wikimedia.org/T401425) (owner: 10Scott French)
[18:25:00] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to <wmde and nda>for <mahmoud-abdelsattar> - https://phabricator.wikimedia.org/T403695#11149491 (10KFrancis) Hi All, there is not an NDA on file for Mahmoud Abdelsattar.  @mahmoud.abdelsattar.wmde Ple...
[18:25:53] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm
[18:25:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11149495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm executed with errors: - dse-k8s-worker10...
[18:26:00] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: ignore kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1184894 (https://phabricator.wikimedia.org/T401425) (owner: 10Scott French)
[18:26:55] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good, verified from RFC the Allow: format." [puppet] - 10https://gerrit.wikimedia.org/r/1184887 (https://phabricator.wikimedia.org/T403767) (owner: 10Vgutierrez)
[18:28:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[18:28:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11149517 (10Jhancock.wm) @elukey   okay so what i did today in terms of firmware updates is: cp2044 BIOS, iDRAC, SSD  cp2046 BIOS, iDRAC only  cp2047 BIOS only...
[18:28:54] <wikibugs>	 (03CR) 10Scott French: mediawiki-dumps-legacy: Use in-pod mcrouter container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) (owner: 10Effie Mouzeli)
[18:31:43] <wikibugs>	 (03PS3) 10Krinkle: tests: Add test for wmfApplyEtcdDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228
[18:31:47] <wikibugs>	 (03CR) 10Krinkle: tests: Add test for wmfApplyEtcdDBConfig() (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 (owner: 10Krinkle)
[18:34:51] <wikibugs>	 (03PS1) 10Dzahn: zuul::main: use profile docker::engine to install docker [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938)
[18:35:22] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11149539 (10VRiley-WMF) es1057 - rack C3, U14
[18:35:46] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 40731
[18:36:10] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 40731
[18:36:51] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 212635
[18:38:25] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 212635
[18:38:36] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "lgtm! https://puppet-compiler.wmflabs.org/output/1184891/6851/deploy1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:38:40] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] scap::master: Initialize git repo at $patches_path if needed [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:39:12] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P82588 and previous config saved to /var/cache/conftool/dbconfig/20250904-183911-ladsgroup.json
[18:41:33] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks Dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:48:08] <wikibugs>	 (03PS1) 10Dreamy Jazz: Create checkuser-suggested-investigations.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184903 (https://phabricator.wikimedia.org/T403471)
[18:48:16] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "@Krinkle: this should have fixed the issue on beta deploy server. I confirmed it was noop on both prod servers." [puppet] - 10https://gerrit.wikimedia.org/r/1184891 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy)
[18:49:03] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] "Thank you!!!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184706 (owner: 10PipelineBot)
[18:49:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Create checkuser-suggested-investigations.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184903 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz)
[18:49:57] <wikibugs>	 (03PS4) 10Dreamy Jazz: tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471)
[18:50:09] <wikibugs>	 (03PS3) 10Krinkle: Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595)
[18:53:37] <wikibugs>	 (03PS2) 10Dzahn: zuul::main: use profile docker::engine to install docker [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938)
[18:53:40] <wikibugs>	 (03PS8) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595)
[18:53:40] <wikibugs>	 (03PS4) 10Krinkle: varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510)
[18:54:19] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T402925)', diff saved to https://phabricator.wikimedia.org/P82590 and previous config saved to /var/cache/conftool/dbconfig/20250904-185418-ladsgroup.json
[18:54:23] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[18:54:24] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance
[18:54:29] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[18:54:32] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T402925)', diff saved to https://phabricator.wikimedia.org/P82591 and previous config saved to /var/cache/conftool/dbconfig/20250904-185431-ladsgroup.json
[18:54:34] <wikibugs>	 (03PS2) 10Dreamy Jazz: Create checkuser-suggested-investigations.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184903 (https://phabricator.wikimedia.org/T403471)
[18:54:46] <Dreamy_Jazz>	 jouncebot: nowandnext
[18:54:46] <jouncebot>	 For the next 1 hour(s) and 5 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1800)
[18:54:46] <jouncebot>	 In 1 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T2000)
[18:55:30] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] ""Please ensure you declare profile::base::overlayfs: true in hiera."" [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[18:56:08] <wikibugs>	 (03PS3) 10Dzahn: zuul::main: use profile docker::engine to install docker [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938)
[18:56:20] <mutante>	 jouncebot: help
[18:56:20] <jouncebot>	 **** JounceBot Help ****
[18:56:20] <jouncebot>	 JounceBot is a deployment helper bot for the Wikimedia movement.
[18:56:20] <jouncebot>	 Source at: https://gerrit.wikimedia.org/g/wikimedia/bots/jouncebot
[18:56:20] <jouncebot>	 Available commands:
[18:56:20] <jouncebot>	  HELP    Print all commands known to the server.
[18:56:20] <jouncebot>	  NEXT    Get the next deployment event(s if they happen at the same time).
[18:56:21] <jouncebot>	  NOW     Get the current deployment event(s) or the time until the next.
[18:56:21] <jouncebot>	  NOWANDNEXT Get the current and next deployment event(s).
[18:56:22] <jouncebot>	  REFRESH Refresh my knowledge about deployments.
[18:56:45] <Dreamy_Jazz>	 Anyone mind if I backport?
[18:56:52] <dancy>	 OK w/ me
[18:56:56] <Dreamy_Jazz>	 Thanks
[18:57:00] <mutante>	 no concerns
[18:57:20] <Dreamy_Jazz>	 My config change will be a no-op so shouldn't see any changes in any logs etc
[18:57:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184903 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz)
[18:57:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:58:12] <wikibugs>	 (03Merged) 10jenkins-bot: Create checkuser-suggested-investigations.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184903 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz)
[18:58:32] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1184903|Create checkuser-suggested-investigations.dblist (T403471)]]
[18:58:33] <wikibugs>	 (03CR) 10Dreamy Jazz: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz)
[18:58:35] <stashbot>	 T403471: Document suggested investigations tables in tables-catalog.yaml - https://phabricator.wikimedia.org/T403471
[18:58:56] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:59:12] <wikibugs>	 (03CR) 10D3r1ck01: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 (owner: 10Krinkle)
[19:01:15] <wikibugs>	 (03CR) 10Ssingh: varnish: factor out unified_mobile_domain_regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[19:02:04] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] Graph: Use new placeholder i18n from WikimediaMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[19:03:42] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1184903|Create checkuser-suggested-investigations.dblist (T403471)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:03:46] <stashbot>	 T403471: Document suggested investigations tables in tables-catalog.yaml - https://phabricator.wikimedia.org/T403471
[19:05:44] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[19:08:56] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[19:10:27] <wikibugs>	 (03CR) 10Jforrester: "(Needs to wait for 1.45.0-wmf.18 to be everywhere.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[19:11:08] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184903|Create checkuser-suggested-investigations.dblist (T403471)]] (duration: 12m 36s)
[19:11:12] <stashbot>	 T403471: Document suggested investigations tables in tables-catalog.yaml - https://phabricator.wikimedia.org/T403471
[19:13:21] <Dreamy_Jazz>	 I'm done with my deploys
[19:13:57] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:18:27] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "this also gets us overlayfs!  -> https://puppet-compiler.wmflabs.org/output/1184900/6854/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[19:19:01] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul::main: use profile docker::engine to install docker [puppet] - 10https://gerrit.wikimedia.org/r/1184900 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[19:20:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11149725 (10Jclark-ctr)
[19:21:31] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[19:33:48] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[19:34:27] <wikibugs>	 (03CR) 10Krinkle: varnish: factor out unified_mobile_domain_regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[19:38:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11149794 (10Jclark-ctr) @bking @BTullis  Can you assist with preseed.yaml? It doesn’t appear to be configured for EFI booting on this server. EFI is required since it was ordered with NVMe Drives
[19:41:34] <Dreamy_Jazz>	 jouncebot: nowandnext
[19:41:34] <jouncebot>	 For the next 0 hour(s) and 18 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T1800)
[19:41:34] <jouncebot>	 In 0 hour(s) and 18 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T2000)
[19:41:49] <Dreamy_Jazz>	 Anyone mind if I deploy a security patch?
[19:42:23] <perryprog>	 is it the one that fixes the Evil Exploit that I'm actively abusing where, uh, uh, I uhh....
[19:42:24] <perryprog>	 idk
[19:42:26] <perryprog>	 something funny
[19:44:41] <Dreamy_Jazz>	 Yes. The very evil exploit
[19:44:48] <perryprog>	 ughhh. Okay fine, I'll let you patch it.
[19:44:53] <Dreamy_Jazz>	 :D
[19:53:03] <wikibugs>	 (03PS3) 10Cory Massaro: Increase max recursion depth in the orchestrator's composition language. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766
[19:53:31] <wikibugs>	 (03PS7) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595)
[19:53:39] <logmsgbot>	 !log dreamyjazz Deployed security patch for T403757
[19:53:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[19:54:46] <wikibugs>	 (03PS8) 10Krinkle: varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595)
[19:55:57] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[19:57:24] <Dreamy_Jazz>	 I'm done with my security deploy in case anyone wants to use the late backport window
[19:58:43] <wikibugs>	 (03PS1) 10RLazarus: mw-videoscaler: Upgrade to envoy 1.26.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184918 (https://phabricator.wikimedia.org/T402584)
[19:59:32] <wikibugs>	 (03PS3) 10Bking: dse-k8s-eqiad: Add opensearch-ipoid namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1184551 (https://phabricator.wikimedia.org/T403534)
[20:00:06] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T2000).
[20:00:06] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:04:46] <wikibugs>	 (03CR) 10RLazarus: mw-videoscaler: Upgrade to envoy 1.26.8 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184918 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus)
[20:04:48] <wikibugs>	 (03CR) 10A smart kitten: hcaptcha: Respond with HTTP 405 to disallowed methods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis)
[20:05:24] <wikibugs>	 (03CR) 10Krinkle: "PCC for prod:" [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[20:05:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:06:27] <wikibugs>	 (03PS9) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595)
[20:07:51] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184918 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus)
[20:10:31] <wikibugs>	 (03CR) 10Bking: dse-k8s-eqiad: Add opensearch-ipoid namespaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184551 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking)
[20:13:19] <wikibugs>	 (03PS2) 10Bking: dse-k8s-eqiad: Add opensearch-ipoid namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534)
[20:14:31] <wikibugs>	 (03CR) 10Andrea Denisse: "Hi Daniel, I can't recall any discussions regarding it at the time but it's a tool worth exploring so I'll add it to my team's agenda." [puppet] - 10https://gerrit.wikimedia.org/r/1184756 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb)
[20:14:56] <wikibugs>	 (03CR) 10Bking: dse-k8s-eqiad: Add opensearch-ipoid namespaces (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking)
[20:15:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:16:50] <wikibugs>	 (03PS4) 10Krinkle: Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595)
[20:16:50] <wikibugs>	 (03PS2) 10Krinkle: Disable wmgUseMdotRouting on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184131 (https://phabricator.wikimedia.org/T403510)
[20:16:51] <wikibugs>	 (03PS5) 10Krinkle: varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510)
[20:16:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[20:17:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Disable wmgUseMdotRouting on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184131 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[20:17:11] <James_F>	 Krinkle: I'm very excited for that; good luck!
[20:18:12] <Krinkle>	 thx
[20:18:13] <wikibugs>	 (03PS5) 10Krinkle: Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595)
[20:18:13] <wikibugs>	 (03PS3) 10Krinkle: Disable wmgUseMdotRouting on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184131 (https://phabricator.wikimedia.org/T403510)
[20:31:53] <wikibugs>	 (03PS2) 10RLazarus: mw-videoscaler: Upgrade to envoy 1.26.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184918 (https://phabricator.wikimedia.org/T402584)
[20:44:53] <wikibugs>	 (03PS1) 10Dzahn: zuul::executor: add parameter for port and set it to 7100 [puppet] - 10https://gerrit.wikimedia.org/r/1184924 (https://phabricator.wikimedia.org/T395938)
[20:49:44] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch-operator: Add chart for review (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[20:51:18] <wikibugs>	 (03PS7) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[20:52:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[20:53:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[20:57:00] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11150061 (10Krinkle)
[20:58:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250904T2100)
[21:00:44] <wikibugs>	 (03PS1) 10Bking: WIP: Introduce opensearch-operator to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184932 (https://phabricator.wikimedia.org/T397246)
[21:03:48] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[21:03:56] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:04:04] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[21:11:27] <sbassett>	 Hey all - would like to get one quick sec patch out right now if I can.  Let me know if I should hold off…
[21:16:59] <logmsgbot>	 jhathaway@cumin1002 provision (PID 2164935) is awaiting input
[21:17:29] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[21:18:07] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Make varnish-frontend-restart work on Beta Cluster - https://phabricator.wikimedia.org/T299054#11150196 (10bd808)
[21:18:57] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Make varnish-frontend-restart work on Beta Cluster - https://phabricator.wikimedia.org/T299054#11150202 (10bd808)
[21:19:05] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Make varnish-frontend-restart work on Beta Cluster - https://phabricator.wikimedia.org/T299054#11150204 (10bd808)
[21:23:45] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[21:25:45] <wikibugs>	 (03PS1) 10BryanDavis: beta: Remove replica instance from wmgMainStashServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184937 (https://phabricator.wikimedia.org/T401227)
[21:27:54] <sbassett>	 !log Deployed security fix for T403411 to 1.45.0-wmf.17
[21:27:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:03] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply new opensearch plugins pkg - bking@cumin1002 - T403749
[21:31:07] <stashbot>	 T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749
[21:33:57] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[21:35:06] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:37:04] <logmsgbot>	 !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply new opensearch plugins pkg - bking@cumin1002 - T403749
[21:37:07] <stashbot>	 T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749
[21:38:20] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new opensearch plugins pkg - bking@cumin1002 - T403749
[21:48:56] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[21:51:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[21:52:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:57:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:59:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr3-ulsfo and 198.35.26.209 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:00:18] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T402925)', diff saved to https://phabricator.wikimedia.org/P82593 and previous config saved to /var/cache/conftool/dbconfig/20250904-220017-ladsgroup.json
[22:00:22] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[22:04:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr3-ulsfo and 198.35.26.209 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:06:09] <logmsgbot>	 !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new opensearch plugins pkg - bking@cumin1002 - T403749
[22:06:12] <stashbot>	 T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749
[22:11:20] <wikibugs>	 (03Abandoned) 10Ahmon Dancy: buildkitd.toml.erb: Temporarily enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1184830 (https://phabricator.wikimedia.org/T396924) (owner: 10Ahmon Dancy)
[22:15:26] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P82594 and previous config saved to /var/cache/conftool/dbconfig/20250904-221525-ladsgroup.json
[22:21:48] <logmsgbot>	 !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply new opensearch plugins pkg - ryankemper@cumin1002 - T403749
[22:21:51] <stashbot>	 T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749
[22:30:33] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P82595 and previous config saved to /var/cache/conftool/dbconfig/20250904-223032-ladsgroup.json
[22:39:58] <wikibugs>	 (03PS1) 10Scott French: P:rsyslog::kafka_shipper: configure output lookup for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184942
[22:40:00] <wikibugs>	 (03PS3) 10Scott French: P:logstash::common: update filters for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184943
[22:41:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:41:47] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review, Cole!" [puppet] - 10https://gerrit.wikimedia.org/r/1184943 (owner: 10Scott French)
[22:45:07] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Add ulsfo private v4 range to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1184816 (https://phabricator.wikimedia.org/T294845) (owner: 10Ayounsi)
[22:45:36] <logmsgbot>	 !log ryankemper@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply new opensearch plugins pkg - ryankemper@cumin1002 - T403749
[22:45:40] <stashbot>	 T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749
[22:45:41] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T402925)', diff saved to https://phabricator.wikimedia.org/P82596 and previous config saved to /var/cache/conftool/dbconfig/20250904-224540-ladsgroup.json
[22:45:44] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[22:45:57] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance
[22:46:04] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2179 (T402925)', diff saved to https://phabricator.wikimedia.org/P82597 and previous config saved to /var/cache/conftool/dbconfig/20250904-224604-ladsgroup.json
[22:46:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:47:55] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:48:55] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:51:11] <wikibugs>	 (03PS5) 10Dreamy Jazz: tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471)
[22:51:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[22:51:46] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz)
[22:53:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11150505 (10Papaul) mr1-ulsfo is now running BGP . All OSPF entries on mr1-ulsfo, cr3-ulsfo and cr4-ulsfo for the  management network removed.
[22:58:57] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:02:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:08:57] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[23:11:17] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] P:logstash::common: update filters for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184943 (owner: 10Scott French)
[23:11:40] <wikibugs>	 (03PS4) 10Scott French: P:logstash::common: update filters for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184943
[23:12:08] <logmsgbot>	 !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply new opensearch plugins pkg - ryankemper@cumin1002 - T403749
[23:12:11] <stashbot>	 T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749
[23:12:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:13:57] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:24:15] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] P:logstash::common: update filters for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184943 (owner: 10Scott French)
[23:27:53] <wikibugs>	 (03PS2) 10Scott French: P:rsyslog::kafka_shipper: configure output lookup for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184942
[23:32:46] <jinxer-wm>	 FIRING: Traffic bill over quota: Alert for device cr2-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[23:35:32] <wikibugs>	 (03PS1) 10Scott French: shellbox-syntaxhighlight: revert single-replica 8.3 pilot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184950 (https://phabricator.wikimedia.org/T403284)
[23:37:46] <jinxer-wm>	 FIRING: [2x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[23:38:05] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184951
[23:38:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184951 (owner: 10TrainBranchBot)
[23:38:15] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox-syntaxhighlight: revert single-replica 8.3 pilot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184950 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[23:39:59] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: revert single-replica 8.3 pilot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184950 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[23:40:53] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[23:41:01] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[23:41:24] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[23:41:36] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[23:41:48] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[23:41:53] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[23:42:31] <swfrench-wmf>	 !log finished single-replica PHP 8.3 pilot on shellbox-syntaxhighlight - T403284
[23:42:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:34] <stashbot>	 T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284
[23:50:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184951 (owner: 10TrainBranchBot)
[23:51:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[23:52:46] <jinxer-wm>	 FIRING: [2x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[23:57:46] <jinxer-wm>	 RESOLVED: Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota