[00:04:40] (03PS1) 10C. Scott Ananian: Replace ParamType with ListType [extensions/ReadingLists] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184185 [00:05:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/ReadingLists] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184185 (owner: 10C. Scott Ananian) [00:06:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T402925)', diff saved to https://phabricator.wikimedia.org/P82436 and previous config saved to /var/cache/conftool/dbconfig/20250903-000629-ladsgroup.json [00:06:33] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [00:06:44] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1238.eqiad.wmnet with reason: Maintenance [00:06:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1238 (T402925)', diff saved to https://phabricator.wikimedia.org/P82437 and previous config saved to /var/cache/conftool/dbconfig/20250903-000651-ladsgroup.json [00:08:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184187 [00:08:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184187 (owner: 10TrainBranchBot) [00:13:56] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:28:56] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:28:56] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:31:57] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184187 (owner: 10TrainBranchBot) [01:00:43] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:04:05] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11142125 (10phaultfinder) [01:09:06] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11142129 (10phaultfinder) [01:12:28] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 45s) [01:18:56] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:25:26] (03CR) 10Samuel (WMF): [C:03+1] hCaptcha: Set log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183979 (owner: 10Kosta Harlan) [01:33:56] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:48:56] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:03:56] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:03:56] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:19:30] 10ops-eqiad, 06DC-Ops: Unresponsive management for an-worker1233.mgmt:22 - https://phabricator.wikimedia.org/T403569 (10phaultfinder) 03NEW [02:33:56] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [02:40:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T402925)', diff saved to https://phabricator.wikimedia.org/P82439 and previous config saved to /var/cache/conftool/dbconfig/20250903-024011-ladsgroup.json [02:40:15] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:43:56] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:48:56] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:53:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:55:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P82440 and previous config saved to /var/cache/conftool/dbconfig/20250903-025518-ladsgroup.json [03:03:56] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:56] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:10:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P82441 and previous config saved to /var/cache/conftool/dbconfig/20250903-031026-ladsgroup.json [03:25:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T402925)', diff saved to https://phabricator.wikimedia.org/P82442 and previous config saved to /var/cache/conftool/dbconfig/20250903-032534-ladsgroup.json [03:25:37] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [03:25:49] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance [03:25:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1241 (T402925)', diff saved to https://phabricator.wikimedia.org/P82443 and previous config saved to /var/cache/conftool/dbconfig/20250903-032556-ladsgroup.json [03:26:11] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11142215 (10DavidBrooks) Another question about response 429, sorry. Is there a range of values I can expect for Retry-After? AWB already retries 30 second... [04:27:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:28:56] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:32:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:48:56] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:55:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184122 (https://phabricator.wikimedia.org/T386131) (owner: 10Sbisson) [04:56:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [05:06:45] (03PS1) 10Papaul: Remove OSFP on mr1-ulsfo, cr3 and cr4 ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1184202 (https://phabricator.wikimedia.org/T294845) [05:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:06] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11142256 (10phaultfinder) [05:13:56] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:14:06] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11142258 (10phaultfinder) [05:18:56] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:33:56] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:33:56] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:39:52] (03PS1) 10Papaul: Add back replace ospf to mr.conf [homer/public] - 10https://gerrit.wikimedia.org/r/1184204 (https://phabricator.wikimedia.org/T294845) [05:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:48:56] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:53:56] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:53:56] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:53:56] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:56:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T0600) [06:01:25] 10SRE-swift-storage, 10MediaWiki-File-management, 07Performance Issue: Revision deletion on image files is excessively slow - https://phabricator.wikimedia.org/T403572#11142267 (10Bugreporter) [06:02:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T402925)', diff saved to https://phabricator.wikimedia.org/P82444 and previous config saved to /var/cache/conftool/dbconfig/20250903-060244-ladsgroup.json [06:02:48] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [06:17:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P82445 and previous config saved to /var/cache/conftool/dbconfig/20250903-061752-ladsgroup.json [06:18:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (PUT flinkdeployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:28:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PUT flinkdeployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:30:35] (03Abandoned) 10Kosta Harlan: hCaptcha: Set log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183979 (owner: 10Kosta Harlan) [06:33:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P82446 and previous config saved to /var/cache/conftool/dbconfig/20250903-063259-ladsgroup.json [06:33:56] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [06:34:15] (03PS1) 10Samwilson: CommonSettings: Add CommunityRequests projects and group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184375 (https://phabricator.wikimedia.org/T393860) [06:38:56] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:43:56] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:43:57] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:45:36] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:48:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T402925)', diff saved to https://phabricator.wikimedia.org/P82447 and previous config saved to /var/cache/conftool/dbconfig/20250903-064807-ladsgroup.json [06:48:11] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [06:48:24] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance [06:48:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1242 (T402925)', diff saved to https://phabricator.wikimedia.org/P82448 and previous config saved to /var/cache/conftool/dbconfig/20250903-064830-ladsgroup.json [06:48:56] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:53:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:55:26] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [06:57:30] (03CR) 10Slyngshede: [V:03+1] "Not entirely sure that this actually works, but trying it seems like the easiest way to find out." [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [06:59:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T0700). [07:00:05] Msz2001 and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:17] o/ [07:02:10] here [07:02:34] Msz2001: go ahead with your patch and let me know once done [07:02:51] I'm not a deployer, I need someone to deploy it for me [07:03:40] OK. Give me a minute, will deploy. [07:03:45] Thanks! [07:03:56] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:06:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184089 (owner: 10Mszwarc) [07:08:25] (03CR) 10A smart kitten: hcaptcha: Redirect / to mw.o project page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184157 (owner: 10BryanDavis) [07:08:56] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:09:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:11:25] (03PS3) 10Arnaudb: gitlab: alert on sidekiq queue piling up [alerts] - 10https://gerrit.wikimedia.org/r/1184378 [07:11:26] (03CR) 10Arnaudb: "yesterday's pattern: https://grafana.wikimedia.org/goto/lZ5l3vrHR?orgId=1" [alerts] - 10https://gerrit.wikimedia.org/r/1184378 (owner: 10Arnaudb) [07:13:24] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:13:24] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:13:57] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:14:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and 208.80.153.216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:14:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr3-knams (208.80.153.216) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:17:32] (03Merged) 10jenkins-bot: Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184089 (owner: 10Mszwarc) [07:18:14] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1184089|Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis"]] [07:18:52] (03CR) 10Tiziano Fogli: [C:03+1] airflow: remove nrpe definitions [puppet] - 10https://gerrit.wikimedia.org/r/1184171 (https://phabricator.wikimedia.org/T384214) (owner: 10Cwhite) [07:19:05] (03CR) 10Tiziano Fogli: [C:03+1] airflow: disable icinga nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/1184169 (https://phabricator.wikimedia.org/T384214) (owner: 10Cwhite) [07:19:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqdfw and 208.80.153.216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:19:19] (03CR) 10Tiziano Fogli: [C:03+1] hiera: disable monitoring for legacy profile::airflow::instances [puppet] - 10https://gerrit.wikimedia.org/r/1184170 (https://phabricator.wikimedia.org/T384214) (owner: 10Cwhite) [07:21:51] (03CR) 10Ayounsi: [C:03+1] Add back replace ospf to mr.conf [homer/public] - 10https://gerrit.wikimedia.org/r/1184204 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [07:22:22] (03CR) 10Ayounsi: [C:03+1] Remove OSFP on mr1-ulsfo, cr3 and cr4 ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1184202 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [07:22:48] (03PS1) 10Elukey: Release upstream version 1.31.0.8 [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1184388 (https://phabricator.wikimedia.org/T398600) [07:23:13] !log kartik@deploy1003 kartik, mszwarc: Backport for [[gerrit:1184089|Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:24:13] kart_: verified, you can continue [07:24:40] cool. Thanks for testing. [07:24:45] !log kartik@deploy1003 kartik, mszwarc: Continuing with sync [07:25:54] (03PS1) 10Tiziano Fogli: base: remove check_microcode [puppet] - 10https://gerrit.wikimedia.org/r/1184447 (https://phabricator.wikimedia.org/T350694) [07:26:00] (03CR) 10KartikMistry: [C:03+2] CxServerClient: Log url instead of relative path upon failure [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184122 (https://phabricator.wikimedia.org/T386131) (owner: 10Sbisson) [07:28:40] elukey@cumin1003 provision (PID 793280) is awaiting input [07:30:08] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184089|Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis"]] (duration: 11m 54s) [07:30:52] Thanks for deploying [07:31:14] (03CR) 10Elukey: "Built on build2002, copied the deb on ml-serve1012 and installed." [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1184388 (https://phabricator.wikimedia.org/T398600) (owner: 10Elukey) [07:35:09] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:35:53] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [07:36:29] (03PS1) 10Tiziano Fogli: check_gdnsd_checkconf: enable nrpe wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) [07:36:54] (03Merged) 10jenkins-bot: CxServerClient: Log url instead of relative path upon failure [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184122 (https://phabricator.wikimedia.org/T386131) (owner: 10Sbisson) [07:37:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [07:37:30] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1184122|CxServerClient: Log url instead of relative path upon failure (T386131)]] [07:37:34] T386131: Newly translated sections of articles always placed at the bottom - https://phabricator.wikimedia.org/T386131 [07:38:15] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:39:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-esams (208.80.153.216) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:44:29] !log kartik@deploy1003 kartik, sbisson: Backport for [[gerrit:1184122|CxServerClient: Log url instead of relative path upon failure (T386131)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:44:33] T386131: Newly translated sections of articles always placed at the bottom - https://phabricator.wikimedia.org/T386131 [07:46:36] (03PS2) 10Elukey: Release upstream version 1.31.0.8 [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1184388 (https://phabricator.wikimedia.org/T398600) [07:49:06] !log kartik@deploy1003 kartik, sbisson: Continuing with sync [07:51:23] I'm skipping my second patch and moving to the next window. [07:51:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [07:54:17] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184122|CxServerClient: Log url instead of relative path upon failure (T386131)]] (duration: 16m 46s) [07:54:20] T386131: Newly translated sections of articles always placed at the bottom - https://phabricator.wikimedia.org/T386131 [07:54:46] Nevermind, doing it :D [07:54:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [07:55:48] (03Merged) 10jenkins-bot: ContentTranslation: Add cxserver host for server-side requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [07:56:14] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1184112|ContentTranslation: Add cxserver host for server-side requests (T386131)]] [07:58:51] elukey@cumin1003 provision (PID 796703) is awaiting input [07:59:21] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:59:22] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Novem Linguae - https://phabricator.wikimedia.org/T403336#11142474 (10JMeybohm) [08:00:05] dancy and andre: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T0800). [08:01:28] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2045'] [08:02:00] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2045'] [08:02:15] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2045'] [08:02:26] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2045'] [08:03:11] !log kartik@deploy1003 ngkountas, kartik: Backport for [[gerrit:1184112|ContentTranslation: Add cxserver host for server-side requests (T386131)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:03:14] T386131: Newly translated sections of articles always placed at the bottom - https://phabricator.wikimedia.org/T386131 [08:04:12] (03CR) 10Ayounsi: [C:03+1] "nice!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1182796 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:04:46] (03CR) 10Ayounsi: [C:03+1] JunOS IBGP: adjust template to work with updated data from plugin [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:11:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:12:50] (03PS1) 10Ayounsi: esams: rename sandbox range to virtual [puppet] - 10https://gerrit.wikimedia.org/r/1184471 (https://phabricator.wikimedia.org/T403580) [08:13:15] (03PS1) 10JMeybohm: Add shell and analytics-privatedata-users access for novemlinguae [puppet] - 10https://gerrit.wikimedia.org/r/1184472 (https://phabricator.wikimedia.org/T403336) [08:13:36] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Novem Linguae - https://phabricator.wikimedia.org/T403336#11142515 (10JMeybohm) [08:14:20] andre: I'm still with my config patch, some more time to test with it.. [08:14:49] kart_, deployment is still 10h away so no worries :) [08:15:02] no UTC morning shift this week [08:18:21] (03CR) 10Fabfur: [C:03+1] P:cache:haproxy replace semicolons in ISP names [puppet] - 10https://gerrit.wikimedia.org/r/1180814 (owner: 10Slyngshede) [08:22:38] (03CR) 10Elukey: profile::pyrra::filesystem::slo: refactor the class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [08:23:23] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:26:26] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:27:07] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:31:13] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update esams sandbox IPs to routed ganeti - ayounsi@cumin1003" [08:31:48] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update esams sandbox IPs to routed ganeti - ayounsi@cumin1003" [08:31:49] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:32:28] !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:32:37] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-worker1233 on all recursors [08:32:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-worker1233 on all recursors [08:32:41] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1233 [08:33:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1233 [08:33:56] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:34:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from dumpsdata1004 to an-worker1233 [08:34:54] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from dumpsdata1005 to an-worker1234 [08:35:16] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [08:36:27] (03PS2) 10Filippo Giunchedi: openstack: add wmcs-server-id [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) [08:36:40] !log kartik@deploy1003 ngkountas, kartik: Continuing with sync [08:37:00] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.makevm for new host atlas3001.wikimedia.org [08:37:00] (03CR) 10CI reject: [V:04-1] openstack: add wmcs-server-id [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi) [08:37:35] (03CR) 10Filippo Giunchedi: openstack: add wmcs-server-id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi) [08:37:59] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:38:39] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming dumpsdata1005 to an-worker1234 - btullis@cumin1003" [08:38:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming dumpsdata1005 to an-worker1234 - btullis@cumin1003" [08:38:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:38:57] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-worker1234 on all recursors [08:39:00] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-worker1234 on all recursors [08:39:01] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1234 [08:39:13] (03PS3) 10Filippo Giunchedi: openstack: add wmcs-server-id [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) [08:39:22] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:40:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1234 [08:40:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from dumpsdata1005 to an-worker1234 [08:41:15] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [08:42:00] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184112|ContentTranslation: Add cxserver host for server-side requests (T386131)]] (duration: 45m 46s) [08:42:03] T386131: Newly translated sections of articles always placed at the bottom - https://phabricator.wikimedia.org/T386131 [08:42:15] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1234.eqiad.wmnet with OS bullseye [08:43:08] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1003" [08:43:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1003" [08:43:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:43:13] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache atlas3001.wikimedia.org on all recursors [08:43:16] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas3001.wikimedia.org on all recursors [08:43:31] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:44:43] (03PS1) 10Joely Rooke WMDE: Remove feature flag to resolve changelist wikibase link labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184480 (https://phabricator.wikimedia.org/T395674) [08:45:00] (03PS1) 10KartikMistry: CX section positioning: Fix cxserver requests to include /v2 in the URL [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184481 (https://phabricator.wikimedia.org/T386131) [08:45:25] (03CR) 10Btullis: [C:03+1] stat hosts: alert on I/O stalls [alerts] - 10https://gerrit.wikimedia.org/r/1184128 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking) [08:47:53] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM atlas3001.wikimedia.org - ayounsi@cumin1003" [08:47:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM atlas3001.wikimedia.org - ayounsi@cumin1003" [08:47:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:47:58] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache atlas3001.wikimedia.org on all recursors [08:48:02] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas3001.wikimedia.org on all recursors [08:48:06] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host atlas3001.wikimedia.org [08:48:56] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:54:10] (03CR) 10Cathal Mooney: [C:03+1] Remove OSFP on mr1-ulsfo, cr3 and cr4 ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1184202 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [08:54:25] (03CR) 10Cathal Mooney: [C:03+1] Add back replace ospf to mr.conf [homer/public] - 10https://gerrit.wikimedia.org/r/1184204 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [08:54:37] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:57:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11142724 (10elukey) I am testing the new provision script on other cp2xxx hosts, and I always end up with the following diff when checking if the settings have... [08:58:45] (03CR) 10FNegri: [C:03+1] openstack: add wmcs-server-id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi) [08:59:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184481 (https://phabricator.wikimedia.org/T386131) (owner: 10KartikMistry) [08:59:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11142742 (10elukey) Summary of the status: * Test https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1173335 to have a final version of the cookbook that... [09:01:15] (03Merged) 10jenkins-bot: CX section positioning: Fix cxserver requests to include /v2 in the URL [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184481 (https://phabricator.wikimedia.org/T386131) (owner: 10KartikMistry) [09:01:43] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1184481|CX section positioning: Fix cxserver requests to include /v2 in the URL (T386131)]] [09:01:46] T386131: Newly translated sections of articles always placed at the bottom - https://phabricator.wikimedia.org/T386131 [09:04:59] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: add wmcs-server-id [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi) [09:05:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2152.codfw.wmnet with reason: Maintenance [09:05:54] !log kartik@deploy1003 kartik: Backport for [[gerrit:1184481|CX section positioning: Fix cxserver requests to include /v2 in the URL (T386131)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:05:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T401906)', diff saved to https://phabricator.wikimedia.org/P82450 and previous config saved to /var/cache/conftool/dbconfig/20250903-090556-fceratto.json [09:06:00] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [09:06:26] (03CR) 10Elukey: [C:03+1] esams: rename sandbox range to virtual [puppet] - 10https://gerrit.wikimedia.org/r/1184471 (https://phabricator.wikimedia.org/T403580) (owner: 10Ayounsi) [09:07:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T401906)', diff saved to https://phabricator.wikimedia.org/P82451 and previous config saved to /var/cache/conftool/dbconfig/20250903-090705-fceratto.json [09:10:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11142825 (10BTullis) [09:10:33] (03PS1) 10David Caro: replica_cnf: disable ssl by default on replica.cnf files [puppet] - 10https://gerrit.wikimedia.org/r/1184484 (https://phabricator.wikimedia.org/T182892) [09:12:35] (03CR) 10Guilherme Gonçalves: [C:03+1] "Nice, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1184128 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking) [09:12:49] (03CR) 10CI reject: [V:04-1] replica_cnf: disable ssl by default on replica.cnf files [puppet] - 10https://gerrit.wikimedia.org/r/1184484 (https://phabricator.wikimedia.org/T182892) (owner: 10David Caro) [09:14:01] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11142842 (10phaultfinder) [09:15:10] (03PS2) 10David Caro: replica_cnf: disable ssl by default on replica.cnf files [puppet] - 10https://gerrit.wikimedia.org/r/1184484 (https://phabricator.wikimedia.org/T182892) [09:15:55] !log kartik@deploy1003 kartik: Continuing with sync [09:17:32] (03CR) 10CI reject: [V:04-1] replica_cnf: disable ssl by default on replica.cnf files [puppet] - 10https://gerrit.wikimedia.org/r/1184484 (https://phabricator.wikimedia.org/T182892) (owner: 10David Caro) [09:19:01] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11142864 (10phaultfinder) [09:19:36] (03CR) 10Cathal Mooney: Nokia: /routing-policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [09:20:30] (03PS3) 10David Caro: replica_cnf: disable ssl by default on replica.cnf files [puppet] - 10https://gerrit.wikimedia.org/r/1184484 (https://phabricator.wikimedia.org/T182892) [09:21:09] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184481|CX section positioning: Fix cxserver requests to include /v2 in the URL (T386131)]] (duration: 19m 26s) [09:21:12] T386131: Newly translated sections of articles always placed at the bottom - https://phabricator.wikimedia.org/T386131 [09:21:21] (03PS4) 10David Caro: replica_cnf: disable ssl by default on replica.cnf files [puppet] - 10https://gerrit.wikimedia.org/r/1184484 (https://phabricator.wikimedia.org/T182892) [09:22:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P82452 and previous config saved to /var/cache/conftool/dbconfig/20250903-092213-fceratto.json [09:23:33] btullis@cumin1003 reimage (PID 803876) is awaiting input [09:24:04] (03CR) 10Ayounsi: [C:03+2] esams: rename sandbox range to virtual [puppet] - 10https://gerrit.wikimedia.org/r/1184471 (https://phabricator.wikimedia.org/T403580) (owner: 10Ayounsi) [09:32:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T402925)', diff saved to https://phabricator.wikimedia.org/P82453 and previous config saved to /var/cache/conftool/dbconfig/20250903-093202-ladsgroup.json [09:32:07] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [09:33:56] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:34:56] (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180814 (owner: 10Slyngshede) [09:35:40] I cannot connect to bast3007, it says the host-name is not known? That seems to be new...? [09:35:53] ssh: Could not resolve hostname bast3007.wikimedia.org: Name or service not known [09:35:53] Connection closed by UNKNOWN port 65535 [09:36:35] But that's still the server mentioned in https://wikitech.wikimedia.org/wiki/Bastion and https://wikitech.wikimedia.org/wiki/SRE/Production_access#SSH_configuration [09:37:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P82454 and previous config saved to /var/cache/conftool/dbconfig/20250903-093720-fceratto.json [09:37:37] MichaelG_WMF: there is a maintenance ongoing on the esams bastion, there was a communication about it last week [09:37:43] you can use the drmrs one or any other one [09:38:00] ( bast6003.wikimedia.org for example) [09:38:21] Comm was to ops-l so it's possible MichaelG_WMF isn't on it? [09:38:37] yeah, I was checking where it was sent [09:38:47] (03PS1) 10Cathal Mooney: Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) [09:38:55] bast6003 worked for me, thanks! [09:39:06] I'll look into getting on that list [09:39:44] (or at least bookmarking a link to its archive so I can check that before asking here next time) [09:40:00] (03CR) 10CI reject: [V:04-1] Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:40:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Reimage sretest2009 as a wikikube worker and assess performance - https://phabricator.wikimedia.org/T400871#11142997 (10Clement_Goubert) We're not using that tag, go ahead and remove it. [09:40:59] everyone with shell access should really be subscribed on that list [09:41:28] It does not seem to be one of the public ones at https://lists.wikimedia.org/postorius/lists/ [09:41:43] https://lists.wikimedia.org/postorius/lists/ops.lists.wikimedia.org/ [09:43:56] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:44:24] taavi thanks! strange that it does not seem to be in the public listing [09:45:05] subscribed ✅ [09:45:30] i don't think private lists are included in that listing [09:47:08] (03CR) 10Vgutierrez: "looking good, I've noticed some small inconsistencies on the tests, please check inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [09:47:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P82455 and previous config saved to /var/cache/conftool/dbconfig/20250903-094710-ladsgroup.json [09:47:13] (03PS1) 10Btullis: Fix the partman config for the newly renamed an-worker123[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/1184489 (https://phabricator.wikimedia.org/T398438) [09:48:53] (03CR) 10Effie Mouzeli: [C:03+1] Add shell and analytics-privatedata-users access for novemlinguae [puppet] - 10https://gerrit.wikimedia.org/r/1184472 (https://phabricator.wikimedia.org/T403336) (owner: 10JMeybohm) [09:48:56] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:52:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T401906)', diff saved to https://phabricator.wikimedia.org/P82456 and previous config saved to /var/cache/conftool/dbconfig/20250903-095228-fceratto.json [09:52:32] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [09:52:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2154.codfw.wmnet with reason: Maintenance [09:52:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T401906)', diff saved to https://phabricator.wikimedia.org/P82457 and previous config saved to /var/cache/conftool/dbconfig/20250903-095251-fceratto.json [09:53:01] (03PS1) 10Tiziano Fogli: nrpe2nodexp: add alertmanager_team param to override role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/1184487 [09:53:12] (03CR) 10Btullis: [C:03+2] Fix the partman config for the newly renamed an-worker123[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/1184489 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [09:55:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T401906)', diff saved to https://phabricator.wikimedia.org/P82458 and previous config saved to /var/cache/conftool/dbconfig/20250903-095501-fceratto.json [09:55:58] (03PS1) 10JMeybohm: Merged multiple MRs: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1184492 [09:56:05] (03CR) 10Ladsgroup: "Hi, we are a bit short on staff (Manuel is out for a while). Would it be okay if this waits a bit?" [alerts] - 10https://gerrit.wikimedia.org/r/1184039 (https://phabricator.wikimedia.org/T315866) (owner: 10Tiziano Fogli) [09:56:45] (03CR) 10JMeybohm: [V:03+2 C:03+2] Merged multiple MRs: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1184492 (owner: 10JMeybohm) [09:57:23] !log jayme@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin1002" [09:57:24] !log jayme@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin1002 [09:58:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin1002 [09:58:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin1002" [09:59:10] FIRING: [3x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:59:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-esams (185.15.59.145) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1000) [10:00:24] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS trixie [10:01:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11143086 (10BTullis) Thanks for checking with us @VRiley-WMF and apologies for the delay in getting back to you. - You can replace this disk at any time co... [10:02:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P82459 and previous config saved to /var/cache/conftool/dbconfig/20250903-100218-ladsgroup.json [10:03:33] (03CR) 10Alexandros Kosiaris: [C:03+1] Add MariaDB test-s8 section VMs [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) (owner: 10Federico Ceratto) [10:04:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:05:01] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11143096 (10BTullis) a:03BTullis [10:07:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:07:57] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11143100 (10BTullis) [10:08:43] (03CR) 10Vgutierrez: [C:04-1] P:cache::haproxy copy datacenter.mmdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [10:09:41] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11143110 (10BTullis) I'm attaching T397330 as a parent task, even though it is a different ap... [10:10:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P82460 and previous config saved to /var/cache/conftool/dbconfig/20250903-101008-fceratto.json [10:12:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:13:06] (03PS2) 10Cathal Mooney: Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) [10:13:30] (03CR) 10JMeybohm: [C:03+2] Add shell and analytics-privatedata-users access for novemlinguae [puppet] - 10https://gerrit.wikimedia.org/r/1184472 (https://phabricator.wikimedia.org/T403336) (owner: 10JMeybohm) [10:14:36] (03CR) 10CI reject: [V:04-1] Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [10:14:46] (03PS1) 10David Caro: object_storage: alert only for our projects [alerts] - 10https://gerrit.wikimedia.org/r/1184494 [10:15:24] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Novem Linguae - https://phabricator.wikimedia.org/T403336#11143127 (10JMeybohm) 05Open→03Resolved a:03JMeybohm The change has been merged, you should have access in about 30min tops. [10:17:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T402925)', diff saved to https://phabricator.wikimedia.org/P82461 and previous config saved to /var/cache/conftool/dbconfig/20250903-101725-ladsgroup.json [10:17:29] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:17:42] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance [10:17:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1243 (T402925)', diff saved to https://phabricator.wikimedia.org/P82462 and previous config saved to /var/cache/conftool/dbconfig/20250903-101749-ladsgroup.json [10:18:20] (03PS3) 10Cathal Mooney: Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) [10:18:24] (03CR) 10Btullis: [C:04-1] "I don't think that this will work. From the docs that you linked:" [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) (owner: 10Ryan Kemper) [10:19:39] (03CR) 10CI reject: [V:04-1] Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [10:22:27] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [10:24:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:24:19] (03PS4) 10Cathal Mooney: Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) [10:24:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-esams (185.15.59.145) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:25:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P82463 and previous config saved to /var/cache/conftool/dbconfig/20250903-102516-fceratto.json [10:25:42] (03CR) 10CI reject: [V:04-1] Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [10:28:56] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [10:32:22] (03PS5) 10Cathal Mooney: Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) [10:33:37] (03PS19) 10Krinkle: varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 (https://phabricator.wikimedia.org/T401595) [10:33:38] (03PS5) 10Krinkle: varnish: Remove 60s cap for mobileaction/useformat on m-dot [puppet] - 10https://gerrit.wikimedia.org/r/1183212 (https://phabricator.wikimedia.org/T401595) [10:33:38] (03PS20) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [10:33:38] (03PS3) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) [10:33:39] (03PS2) 10Krinkle: varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510) [10:33:49] (03CR) 10Krinkle: varnish: Implement new direct routing for mobile views (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [10:33:56] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [10:34:17] (03PS5) 10Fabfur: team-traffic: removed haproxykafka critical alert [alerts] - 10https://gerrit.wikimedia.org/r/1183689 (https://phabricator.wikimedia.org/T370668) [10:34:35] (03PS1) 10Slyngshede: P:cache::haproxy add datacenter information to provenance [puppet] - 10https://gerrit.wikimedia.org/r/1184497 (https://phabricator.wikimedia.org/T398161) [10:35:06] (03PS1) 10Gkyziridis: ml-services: Disable autoscaling on edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184498 (https://phabricator.wikimedia.org/T403378) [10:36:39] (03PS4) 10Slyngshede: P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) [10:36:50] (03CR) 10Slyngshede: P:cache::haproxy copy datacenter.mmdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [10:37:00] (03CR) 10Krinkle: "FYI: MediaWiki Platform Team is taking ShortUrl sunsetting on as essential work next quarter (starting in October). The only thing left is" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184153 (https://phabricator.wikimedia.org/T107188) (owner: 10Jforrester) [10:40:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T401906)', diff saved to https://phabricator.wikimedia.org/P82464 and previous config saved to /var/cache/conftool/dbconfig/20250903-104023-fceratto.json [10:40:28] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [10:40:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2163.codfw.wmnet with reason: Maintenance [10:40:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T401906)', diff saved to https://phabricator.wikimedia.org/P82465 and previous config saved to /var/cache/conftool/dbconfig/20250903-104047-fceratto.json [10:42:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T401906)', diff saved to https://phabricator.wikimedia.org/P82466 and previous config saved to /var/cache/conftool/dbconfig/20250903-104257-fceratto.json [10:43:09] (03CR) 10David Caro: object_storage: alert only for our projects (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1184494 (owner: 10David Caro) [10:46:28] (03CR) 10Slyngshede: [C:03+2] P:cache:haproxy replace semicolons in ISP names [puppet] - 10https://gerrit.wikimedia.org/r/1180814 (owner: 10Slyngshede) [10:48:56] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:25] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:51:25] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:53:57] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:54:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and 208.80.153.216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:54:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-esams (208.80.153.216) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:55:24] (03PS6) 10Cathal Mooney: Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) [10:55:58] (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [10:58:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P82467 and previous config saved to /var/cache/conftool/dbconfig/20250903-105805-fceratto.json [11:00:00] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182542 (owner: 10PipelineBot) [11:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1100). [11:01:05] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1234.eqiad.wmnet with OS bullseye [11:01:31] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1234.eqiad.wmnet with OS bullseye [11:01:44] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182542 (owner: 10PipelineBot) [11:03:56] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:05:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:08:02] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:08:22] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:08:56] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:10:08] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:10:35] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:11:36] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11143380 (10BTullis) This is a taskmanager pod, not a job manager, so it's probably unrelated... [11:11:53] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:05] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:12:06] 06SRE, 06Infrastructure-Foundations: offboard-user: Check for use of email address of user to be offboarded across Puppet repo - https://phabricator.wikimedia.org/T403452#11143388 (10LSobanski) [11:12:33] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:13:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P82468 and previous config saved to /var/cache/conftool/dbconfig/20250903-111313-fceratto.json [11:15:58] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184499 [11:15:58] (03PS1) 10Btullis: Bump the RAM allocated to the rdf-streaming-updater taskmanagers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184500 (https://phabricator.wikimedia.org/T402886) [11:16:38] (03CR) 10Brouberol: [C:03+1] Bump the RAM allocated to the rdf-streaming-updater taskmanagers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184500 (https://phabricator.wikimedia.org/T402886) (owner: 10Btullis) [11:16:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:18:19] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184499 (owner: 10PipelineBot) [11:20:03] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184499 (owner: 10PipelineBot) [11:20:11] (03PS1) 10Daimona Eaytoy: tables-catalog: Document ce_event_contributions (CampaignEvents) [puppet] - 10https://gerrit.wikimedia.org/r/1184501 (https://phabricator.wikimedia.org/T400719) [11:21:33] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:21:59] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:22:25] (03PS2) 10Stevemunene: wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [11:24:19] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:24:38] (03CR) 10Stevemunene: wdqs: (step 2) remove wdqs discovery dns records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [11:24:47] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:25:11] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:25:15] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1234.eqiad.wmnet with reason: host reimage [11:25:40] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:28:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T401906)', diff saved to https://phabricator.wikimedia.org/P82469 and previous config saved to /var/cache/conftool/dbconfig/20250903-112820-fceratto.json [11:28:24] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:28:36] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2164.codfw.wmnet with reason: Maintenance [11:28:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T401906)', diff saved to https://phabricator.wikimedia.org/P82470 and previous config saved to /var/cache/conftool/dbconfig/20250903-112842-fceratto.json [11:29:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1234.eqiad.wmnet with reason: host reimage [11:29:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T401906)', diff saved to https://phabricator.wikimedia.org/P82471 and previous config saved to /var/cache/conftool/dbconfig/20250903-112952-fceratto.json [11:30:06] (03PS2) 10David Caro: object_storage: alert only for our projects [alerts] - 10https://gerrit.wikimedia.org/r/1184494 [11:30:11] (03CR) 10David Caro: object_storage: alert only for our projects (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1184494 (owner: 10David Caro) [11:33:11] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:34:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:36:42] (03CR) 10DCausse: [C:04-1] "ccing Gabriele" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184500 (https://phabricator.wikimedia.org/T402886) (owner: 10Btullis) [11:38:12] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T400968#11143448 (10BTullis) a:03BTullis These alerts occur when the host is under stress from user activity. It is difficult to stop this... [11:39:32] (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:40:39] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05), 13Patch-For-Review: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11143454 (10BTullis) [11:40:42] 07sre-alert-triage, 10Wikidata, 06Wikidata-Omega, 10Wikidata-Query-Service, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292#11143457 (10BTullis) →14Du... [11:45:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P82472 and previous config saved to /var/cache/conftool/dbconfig/20250903-114500-fceratto.json [11:46:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1234.eqiad.wmnet with OS bullseye [11:49:20] (03CR) 10Ayounsi: [C:03+1] Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [11:53:43] (03CR) 10Klausman: [C:03+1] ml-services: Disable autoscaling on edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184498 (https://phabricator.wikimedia.org/T403378) (owner: 10Gkyziridis) [12:00:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P82473 and previous config saved to /var/cache/conftool/dbconfig/20250903-120007-fceratto.json [12:03:25] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: Disable autoscaling on edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184498 (https://phabricator.wikimedia.org/T403378) (owner: 10Gkyziridis) [12:04:14] !log dropping objectcache table in group0 (T397367) [12:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:18] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [12:05:28] (03PS1) 10Ayounsi: esams: remove sandbox filter [homer/public] - 10https://gerrit.wikimedia.org/r/1184507 [12:06:12] (03CR) 10Filippo Giunchedi: [C:03+1] replica_cnf: disable ssl by default on replica.cnf files [puppet] - 10https://gerrit.wikimedia.org/r/1184484 (https://phabricator.wikimedia.org/T182892) (owner: 10David Caro) [12:11:21] (03CR) 10Filippo Giunchedi: [C:03+1] object_storage: alert only for our projects [alerts] - 10https://gerrit.wikimedia.org/r/1184494 (owner: 10David Caro) [12:13:42] (03CR) 10AOkoth: [C:03+1] gitlab: alert on sidekiq queue piling up [alerts] - 10https://gerrit.wikimedia.org/r/1184378 (owner: 10Arnaudb) [12:15:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T401906)', diff saved to https://phabricator.wikimedia.org/P82474 and previous config saved to /var/cache/conftool/dbconfig/20250903-121514-fceratto.json [12:15:20] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:15:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2165.codfw.wmnet with reason: Maintenance [12:15:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T401906)', diff saved to https://phabricator.wikimedia.org/P82475 and previous config saved to /var/cache/conftool/dbconfig/20250903-121538-fceratto.json [12:16:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T401906)', diff saved to https://phabricator.wikimedia.org/P82476 and previous config saved to /var/cache/conftool/dbconfig/20250903-121648-fceratto.json [12:20:28] (03CR) 10David Caro: [C:03+2] object_storage: alert only for our projects [alerts] - 10https://gerrit.wikimedia.org/r/1184494 (owner: 10David Caro) [12:22:20] (03Merged) 10jenkins-bot: object_storage: alert only for our projects [alerts] - 10https://gerrit.wikimedia.org/r/1184494 (owner: 10David Caro) [12:27:27] (03CR) 10Gmodena: "We have seen similar issues with Flink jobs across the board. I wonder how much of this boils down to memory fragmentation." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184500 (https://phabricator.wikimedia.org/T402886) (owner: 10Btullis) [12:28:10] (03PS1) 10Btullis: Exclude rdf-streaming-updater from KubernetesContainerReachingMemoryLimit [alerts] - 10https://gerrit.wikimedia.org/r/1184509 (https://phabricator.wikimedia.org/T402886) [12:29:07] (03PS1) 10Filippo Giunchedi: interface: create rt_tables.d as needed [puppet] - 10https://gerrit.wikimedia.org/r/1184510 (https://phabricator.wikimedia.org/T401899) [12:29:09] (03PS1) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1184511 (https://phabricator.wikimedia.org/T401899) [12:29:31] (03Abandoned) 10Btullis: Bump the RAM allocated to the rdf-streaming-updater taskmanagers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184500 (https://phabricator.wikimedia.org/T402886) (owner: 10Btullis) [12:30:33] (03CR) 10DCausse: [C:03+1] Exclude rdf-streaming-updater from KubernetesContainerReachingMemoryLimit [alerts] - 10https://gerrit.wikimedia.org/r/1184509 (https://phabricator.wikimedia.org/T402886) (owner: 10Btullis) [12:32:47] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05), 13Patch-For-Review: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11143603 (10gmodena) Reposting here for visibility: We have seen simil... [12:32:58] (03CR) 10Btullis: [C:03+2] Exclude rdf-streaming-updater from KubernetesContainerReachingMemoryLimit [alerts] - 10https://gerrit.wikimedia.org/r/1184509 (https://phabricator.wikimedia.org/T402886) (owner: 10Btullis) [12:33:49] FIRING: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:34:09] (03Merged) 10jenkins-bot: Exclude rdf-streaming-updater from KubernetesContainerReachingMemoryLimit [alerts] - 10https://gerrit.wikimedia.org/r/1184509 (https://phabricator.wikimedia.org/T402886) (owner: 10Btullis) [12:35:24] (03CR) 10Jforrester: "Aha, that's cool. Are there tasks tracking your team's planned work here? T107188 doesn't have any blockers but maybe they're elsewhere?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184153 (https://phabricator.wikimedia.org/T107188) (owner: 10Jforrester) [12:39:27] (03CR) 10Brouberol: [C:03+1] "Ok, so I think that validates the point we were making in the last sync. We should do this in puppet, then." [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) (owner: 10Ryan Kemper) [12:40:03] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.makevm for new host atlas3001.wikimedia.org [12:40:04] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [12:41:39] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05), 13Patch-For-Review: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11143665 (10BTullis) We decided to exclude the `rdf-streaming-updater`... [12:43:09] !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts durum3003.esams.wmnet [12:43:48] (03CR) 10Btullis: [C:03+2] dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [12:43:52] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1003" [12:43:59] (03CR) 10Btullis: [C:03+2] dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [12:44:11] (03CR) 10Btullis: [C:03+2] dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [12:44:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1003" [12:44:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:44:22] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache atlas3001.wikimedia.org on all recursors [12:44:26] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas3001.wikimedia.org on all recursors [12:44:40] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [12:45:09] !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts doh3003.wikimedia.org [12:46:33] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:47:10] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:47:24] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [12:50:55] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - Status - issue on an-worker1141:9290 - https://phabricator.wikimedia.org/T403562#11143704 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable [12:50:57] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:00] (03Merged) 10jenkins-bot: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [12:51:24] (03PS1) 10Btullis: Add cumin aliases for dse-k8s in both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/1184512 (https://phabricator.wikimedia.org/T397301) [12:51:38] (03Merged) 10jenkins-bot: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [12:51:39] (03Merged) 10jenkins-bot: dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [12:51:40] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - Status - issue on an-worker1141:9290 - https://phabricator.wikimedia.org/T403561#11143712 (10Jclark-ctr) 05Open→03Declined a:03Jclark-ctr duplicate for T403562 [12:52:04] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM atlas3001.wikimedia.org - ayounsi@cumin1003" [12:52:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM atlas3001.wikimedia.org - ayounsi@cumin1003" [12:52:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:52:23] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache atlas3001.wikimedia.org on all recursors [12:52:26] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas3001.wikimedia.org on all recursors [12:52:27] (03CR) 10Stevemunene: [C:03+1] Add cumin aliases for dse-k8s in both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/1184512 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [12:52:30] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host atlas3001.wikimedia.org [12:52:48] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [12:53:47] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:54:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T402925)', diff saved to https://phabricator.wikimedia.org/P82477 and previous config saved to /var/cache/conftool/dbconfig/20250903-125456-ladsgroup.json [12:55:01] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [12:55:35] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:55:36] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh3003.wikimedia.org [12:55:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11143734 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for hosts: `doh3003.wikimedia.org` - doh3003.wikimedia.org (**PASS**)... [12:56:32] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [12:57:36] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for an-worker1233.mgmt:22 - https://phabricator.wikimedia.org/T403569#11143746 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Rebooted Idrac reachable via gui and serial console [12:57:56] (03CR) 10Btullis: [C:03+2] Add cumin aliases for dse-k8s in both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/1184512 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [12:59:00] 06SRE, 06cloud-services-team, 06serviceops: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392#11143753 (10fgiunchedi) 05Open→03Invalid The drift between public and private private.git will continue to be an issue until we get serious about secrets managem... [12:59:20] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:59:21] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum3003.esams.wmnet [12:59:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11143757 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for hosts: `durum3003.esams.wmnet` - durum3003.esams.wmnet (**PASS**)... [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1300). nyaa~ [13:00:05] cscott, kart_, and Mvolz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:03] o/ [13:01:09] * Lucas_WMDE nyaa~s back at jouncebot [13:01:14] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-08-26-213211 to 2025-09-03-123051 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184517 (https://phabricator.wikimedia.org/T399322) [13:01:15] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-08-25-145906 to 2025-09-02-205403 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184518 [13:01:15] (03PS1) 10Jforrester: wikifunctions: Set Wikidata caching off in advance, with a 1-minute TTL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184519 (https://phabricator.wikimedia.org/T397956) [13:01:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11143772 (10ssingh) `durum3003` and `doh3003` decommissioned. [13:02:21] kart_’s change was already deployed; cscott, Mvolz, want to self-service or do you need a deployer? [13:02:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2166.codfw.wmnet with reason: Maintenance [13:02:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T401906)', diff saved to https://phabricator.wikimedia.org/P82478 and previous config saved to /var/cache/conftool/dbconfig/20250903-130232-fceratto.json [13:02:35] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:03:24] I can spider pig, should I get started? [13:03:56] sure, go ahead [13:04:16] lmk when you're done cscott [13:04:38] (03PS1) 10Btullis: Add the dse-k8s-codfw cluster to the ks8 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) [13:04:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T401906)', diff saved to https://phabricator.wikimedia.org/P82479 and previous config saved to /var/cache/conftool/dbconfig/20250903-130442-fceratto.json [13:05:06] (03CR) 10Stevemunene: [C:03+1] Add the dse-k8s-codfw cluster to the ks8 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [13:05:49] (03PS2) 10Btullis: Add the dse-k8s-codfw cluster to the k8s cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) [13:05:56] (03CR) 10Brouberol: "I think you only need to add the cluster to `ALLOWED_CUMIN_ALIASES` (same file)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [13:06:03] Lucas_WMDE: oops. I forgot to remove my patch. [13:06:54] np ^^ [13:07:49] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Novem Linguae - https://phabricator.wikimedia.org/T403336#11143827 (10Novem_Linguae) Tested, works. Thank you very much. [13:07:49] (03PS3) 10Btullis: Add the dse-k8s-codfw cluster to the k8s cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) [13:07:51] 06SRE, 06DC-Ops, 06serviceops: Reimage sretest2009 as a wikikube worker and assess performance - https://phabricator.wikimedia.org/T400871#11143828 (10Jhancock.wm) [13:08:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/ReadingLists] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184185 (owner: 10C. Scott Ananian) [13:08:43] Mvolz: ok, will let you know [13:10:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P82480 and previous config saved to /var/cache/conftool/dbconfig/20250903-131004-ladsgroup.json [13:11:16] (03Merged) 10jenkins-bot: Replace ParamType with ListType [extensions/ReadingLists] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184185 (owner: 10C. Scott Ananian) [13:11:43] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1184185|Replace ParamType with ListType]] [13:12:06] (03CR) 10Brouberol: Add the dse-k8s-codfw cluster to the k8s cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [13:12:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11143867 (10Jhancock.wm) [13:15:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11143881 (10Jhancock.wm) @jasmine_ how do you feel about the server going in row D? doesn't look like we have one in that row. [13:16:29] (03CR) 10Bking: opensearch-operator: Add chart for review (2/3) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:17:05] (03CR) 10Brouberol: Add the dse-k8s-codfw cluster to the k8s cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [13:17:18] 10ops-codfw, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11143890 (10Jhancock.wm) [13:18:06] (03CR) 10Vgutierrez: varnish: Remove 60s cap for mobileaction/useformat on m-dot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183212 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [13:18:31] !log cscott@deploy1003 cscott: Backport for [[gerrit:1184185|Replace ParamType with ListType]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:19:06] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11143899 (10phaultfinder) [13:19:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P82481 and previous config saved to /var/cache/conftool/dbconfig/20250903-131950-fceratto.json [13:20:20] (03CR) 10Bking: [C:03+2] stat hosts: alert on I/O stalls [alerts] - 10https://gerrit.wikimedia.org/r/1184128 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking) [13:21:29] !log cscott@deploy1003 cscott: Continuing with sync [13:21:34] tested looks good [13:22:45] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [13:24:07] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11143907 (10phaultfinder) [13:25:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P82482 and previous config saved to /var/cache/conftool/dbconfig/20250903-132512-ladsgroup.json [13:26:38] (03PS4) 10Btullis: Add the dse-k8s-codfw cluster to the k8s cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) [13:26:49] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184185|Replace ParamType with ListType]] (duration: 15m 06s) [13:27:10] Mvolz: ok, over to you [13:27:18] 06SRE, 10Cloud-Services, 13Patch-For-Review: Backport sshd with AuthorizedKeysCommand support to Ubuntu precise - https://phabricator.wikimedia.org/T102401#11143917 (10fgiunchedi) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org... [13:27:20] thanks! [13:27:36] (03PS5) 10Btullis: Add the dse-k8s-codfw cluster to the k8s cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) [13:27:45] FIRING: [10x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [13:28:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mvolz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180103 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [13:29:24] (03CR) 10Btullis: Add the dse-k8s-codfw cluster to the k8s cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [13:29:36] (03CR) 10Vgutierrez: [C:03+1] varnish: Implement new direct routing for mobile views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [13:29:46] (03Merged) 10jenkins-bot: Remove all references to deprecated parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180103 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [13:29:46] Starting backport. First time using spiderpig so hopefully it goes smoothly :) [13:30:11] !log mvolz@deploy1003 Started scap sync-world: Backport for [[gerrit:1180103|Remove all references to deprecated parameter (T361576)]] [13:30:14] T361576: Switch from restbase to rest-gateway for Citoid - https://phabricator.wikimedia.org/T361576 [13:30:17] good luck! [13:31:23] (03CR) 10Btullis: Add the dse-k8s-codfw cluster to the k8s cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [13:31:33] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 1 AdminDown: 4 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:31:47] ty! [13:33:07] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11143975 (10Jhancock.wm) for all servers but the first, just going to wait on decoms to bring the server count down. the first one is being addressed by papaul and networks team. will leav... [13:33:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11143974 (10tappof) [13:33:56] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P82483 and previous config saved to /var/cache/conftool/dbconfig/20250903-133457-fceratto.json [13:35:50] !log upgrading envoyproxy to 1.26.8-1, restbase/eqiad (cassandra) rack 'a' — T402584 [13:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:53] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [13:36:45] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11143980 (10BTullis) 05Open→03Resolved [13:36:49] !log mvolz@deploy1003 mvolz: Backport for [[gerrit:1180103|Remove all references to deprecated parameter (T361576)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:36:52] T361576: Switch from restbase to rest-gateway for Citoid - https://phabricator.wikimedia.org/T361576 [13:40:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T402925)', diff saved to https://phabricator.wikimedia.org/P82484 and previous config saved to /var/cache/conftool/dbconfig/20250903-134019-ladsgroup.json [13:40:23] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [13:40:36] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance [13:40:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1244 (T402925)', diff saved to https://phabricator.wikimedia.org/P82485 and previous config saved to /var/cache/conftool/dbconfig/20250903-134043-ladsgroup.json [13:41:49] Everything looks good on the test servers, continuing with sync [13:42:03] !log mvolz@deploy1003 mvolz: Continuing with sync [13:42:43] (03PS3) 10Ayounsi: Nokia: /routing-policy [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 [13:42:45] FIRING: [10x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [13:45:16] !log upgrading envoyproxy to 1.26.8-1, restbase/eqiad (cassandra) rack 'b' — T402584 [13:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:20] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [13:45:49] !log dropping all unused tables of securepoll in s3 (T395928) [13:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:52] T395928: On wikis where user right securepoll-create-poll is missing, delete non-essential SecurePoll SQL tables - https://phabricator.wikimedia.org/T395928 [13:47:24] !log mvolz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180103|Remove all references to deprecated parameter (T361576)]] (duration: 17m 13s) [13:47:28] T361576: Switch from restbase to rest-gateway for Citoid - https://phabricator.wikimedia.org/T361576 [13:47:45] RESOLVED: [8x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [13:48:15] \o/ [13:48:37] \o/ [13:48:56] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:49:01] (03CR) 10Elukey: [C:03+1] ml-services: Disable autoscaling on edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184498 (https://phabricator.wikimedia.org/T403378) (owner: 10Gkyziridis) [13:49:14] !log UTC afternoon backport+config window done [13:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:36] !log upgrading envoyproxy to 1.26.8-1, restbase/eqiad (cassandra) rack 'd' — T402584 [13:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T401906)', diff saved to https://phabricator.wikimedia.org/P82487 and previous config saved to /var/cache/conftool/dbconfig/20250903-135005-fceratto.json [13:50:08] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:50:14] (03PS16) 10Bking: opensearch-operator: Add chart for review (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [13:50:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2167.codfw.wmnet with reason: Maintenance [13:50:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T401906)', diff saved to https://phabricator.wikimedia.org/P82488 and previous config saved to /var/cache/conftool/dbconfig/20250903-135028-fceratto.json [13:51:27] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, there are just some TODOs that caught my attention as the comment says it'll change once in prod, should they have those values alre" [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (owner: 10Tiziano Fogli) [13:51:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T401906)', diff saved to https://phabricator.wikimedia.org/P82489 and previous config saved to /var/cache/conftool/dbconfig/20250903-135139-fceratto.json [13:53:47] (03CR) 10Elukey: [C:03+1] Add the dse-k8s-codfw cluster to the k8s cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [13:54:24] (03PS6) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [13:54:47] (03CR) 10Bking: opensearch-operator: Add chart for review (2/3) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:55:10] (03CR) 10Ayounsi: Nokia: /routing-policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [13:55:12] (03CR) 10Krinkle: varnish: Implement new direct routing for mobile views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [13:55:44] (03PS1) 10Kosta Harlan: hCaptcha: Update logging [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184528 [13:55:50] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:56:23] jouncebot: nowandnext [13:56:24] For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1300) [13:56:24] In 0 hour(s) and 3 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1400) [13:56:42] kostajh: Should be OK to deploy an MW thing in our window, we're services-only today. [13:56:55] James_F: thanks! [13:57:18] (03PS1) 10Elukey: profile::gpu: improvements for new ml k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/1184529 [13:57:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184528 (owner: 10Kosta Harlan) [13:58:01] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-08-26-213211 to 2025-09-03-123051 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184517 (https://phabricator.wikimedia.org/T399322) (owner: 10Jforrester) [13:59:16] (03Abandoned) 10Elukey: profile::gpu: improvements for new ml k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/1184529 (owner: 10Elukey) [13:59:33] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS trixie [13:59:38] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-08-26-213211 to 2025-09-03-123051 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184517 (https://phabricator.wikimedia.org/T399322) (owner: 10Jforrester) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1400) [14:00:23] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:01:13] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:02:14] (03PS1) 10Elukey: Revert "Add a new insetup role for ml-k8s hosts to test their GPU" [puppet] - 10https://gerrit.wikimedia.org/r/1184533 [14:02:53] PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:59] this is me --^ [14:03:22] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:03:49] RESOLVED: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:04:14] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:04:21] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:05:09] RECOVERY - Host ml-serve1012 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [14:05:16] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:06:01] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-08-25-145906 to 2025-09-02-205403 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184518 (owner: 10Jforrester) [14:06:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P82490 and previous config saved to /var/cache/conftool/dbconfig/20250903-140646-fceratto.json [14:07:33] (03CR) 10Dzahn: [C:03+1] gitlab: alert on sidekiq queue piling up [alerts] - 10https://gerrit.wikimedia.org/r/1184378 (owner: 10Arnaudb) [14:07:59] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-08-25-145906 to 2025-09-02-205403 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184518 (owner: 10Jforrester) [14:08:12] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:08:31] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:08:45] (03CR) 10Elukey: [C:03+2] Revert "Add a new insetup role for ml-k8s hosts to test their GPU" [puppet] - 10https://gerrit.wikimedia.org/r/1184533 (owner: 10Elukey) [14:08:56] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:09:01] (03CR) 10Brouberol: "Nicely done!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [14:09:23] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:09:25] (03CR) 10Brouberol: [C:04-1] Add the dse-k8s-codfw cluster to the k8s cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [14:10:05] (03CR) 10Gkyziridis: [C:03+2] ml-services: Disable autoscaling on edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184498 (https://phabricator.wikimedia.org/T403378) (owner: 10Gkyziridis) [14:10:27] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:10:41] (03Merged) 10jenkins-bot: hCaptcha: Update logging [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184528 (owner: 10Kosta Harlan) [14:10:51] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:11:09] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1184528|hCaptcha: Update logging]] [14:11:46] (03CR) 10Jforrester: [C:03+2] wikifunctions: Set Wikidata caching off in advance, with a 1-minute TTL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184519 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester) [14:12:00] (03Merged) 10jenkins-bot: ml-services: Disable autoscaling on edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184498 (https://phabricator.wikimedia.org/T403378) (owner: 10Gkyziridis) [14:13:17] (03CR) 10Dzahn: scap::master: Add /srv/patches git pre-commit hook for permissions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [14:13:18] (03CR) 10Dzahn: [C:03+2] scap::master: Add /srv/patches git pre-commit hook for permissions [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [14:13:42] (03Merged) 10jenkins-bot: wikifunctions: Set Wikidata caching off in advance, with a 1-minute TTL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184519 (https://phabricator.wikimedia.org/T397956) (owner: 10Jforrester) [14:14:00] jouncebot: now [14:14:00] For the next 0 hour(s) and 45 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1400) [14:14:19] PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:11] (03PS1) 10Scott French: changeprop-jobqueue: add CategoryCountUpdateJob rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184162 (https://phabricator.wikimedia.org/T402873) [14:15:20] mutante: We're pretty much done. I think kostajh is done too? [14:15:36] I’m not done yet [14:15:38] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1184528|hCaptcha: Update logging]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:16:30] !log kharlan@deploy1003 kharlan: Continuing with sync [14:16:49] James_F: just merged something that does "Add a git pre-commit hook to /srv/patches" [14:17:38] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:17:56] Oh no, not another damn well-meaning git hook that breaks my flow. [14:18:14] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:18:28] "You have unstaged changes". Correct, I'm doing several things at once, because I know what I'm doing, please piss off, git. [14:19:19] (03PS1) 10Abijeet Patro: TranslationUnitDTO: Make blob type properties writable [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184538 (https://phabricator.wikimedia.org/T402520) [14:19:20] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:19:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184538 (https://phabricator.wikimedia.org/T402520) (owner: 10Abijeet Patro) [14:20:03] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:20:40] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:21:13] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:21:41] this one should only mess with your flow if it involves doing stuff as root :p [14:21:52] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184528|hCaptcha: Update logging]] (duration: 10m 43s) [14:21:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P82491 and previous config saved to /var/cache/conftool/dbconfig/20250903-142154-fceratto.json [14:23:17] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:24:46] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:24:55] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:25:13] 06SRE, 06cloud-services-team, 06serviceops: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392#11144118 (10Dzahn) It was just about a social contract to not forget adding fake secrets when adding real secrets. [14:27:36] (03PS6) 10Brouberol: Add the dse-k8s-codfw cluster to the k8s cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [14:27:59] (03CR) 10Brouberol: [C:03+1] Add the dse-k8s-codfw cluster to the k8s cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [14:28:13] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS trixie [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1430) [14:30:22] mutante: Ack. :-) [14:33:31] (03CR) 10Vgutierrez: [C:03+1] varnish: Implement new direct routing for mobile views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [14:33:56] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [14:34:36] (03CR) 10Krinkle: varnish: Implement new direct routing for mobile views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [14:34:46] (03CR) 10Papaul: [C:03+2] Remove OSFP on mr1-ulsfo, cr3 and cr4 ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1184202 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [14:34:53] (03CR) 10Papaul: [C:03+2] Add back replace ospf to mr.conf [homer/public] - 10https://gerrit.wikimedia.org/r/1184204 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [14:35:12] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 134 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [14:37:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T401906)', diff saved to https://phabricator.wikimedia.org/P82493 and previous config saved to /var/cache/conftool/dbconfig/20250903-143701-fceratto.json [14:37:05] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:37:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2181.codfw.wmnet with reason: Maintenance [14:37:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T401906)', diff saved to https://phabricator.wikimedia.org/P82494 and previous config saved to /var/cache/conftool/dbconfig/20250903-143724-fceratto.json [14:39:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T401906)', diff saved to https://phabricator.wikimedia.org/P82495 and previous config saved to /var/cache/conftool/dbconfig/20250903-143934-fceratto.json [14:41:00] jouncebot: nowandnext [14:41:00] For the next 0 hour(s) and 18 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1400) [14:41:00] For the next 0 hour(s) and 18 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1430) [14:41:01] In 2 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1700) [14:43:03] (03PS1) 10Scott French: Revert "hiddenparma: temporarily disable some admission policies" [puppet] - 10https://gerrit.wikimedia.org/r/1184542 [14:45:36] 06SRE, 06Traffic-Icebox: Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11144182 (10Ottomata) [14:47:16] (03CR) 10Scott French: [C:03+2] Revert "hiddenparma: temporarily disable some admission policies" [puppet] - 10https://gerrit.wikimedia.org/r/1184542 (owner: 10Scott French) [14:48:56] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:24] (03PS1) 10Dreamy Jazz: Instrument CentralAuthUser::getBlocks [extensions/CentralAuth] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184546 (https://phabricator.wikimedia.org/T401701) [14:50:38] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [14:50:54] James_F: Are you done with the Wikifunctions window? [14:51:06] I'd like to ad-hoc backport [14:52:08] (03PS1) 10Ahmon Dancy: modules/scap/templates/patches-pre-commit-hook.erb: Remove tabs [puppet] - 10https://gerrit.wikimedia.org/r/1184548 [14:52:38] (03PS2) 10Ahmon Dancy: modules/scap/templates/patches-pre-commit-hook.erb: Remove tabs [puppet] - 10https://gerrit.wikimedia.org/r/1184548 [14:54:03] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [14:54:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P82496 and previous config saved to /var/cache/conftool/dbconfig/20250903-145441-fceratto.json [14:58:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:00:03] (03PS1) 10Phuedx: MetricsPlatform: Enable overrides everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184549 (https://phabricator.wikimedia.org/T402369) [15:00:54] !log upgrading envoyproxy to 1.26.8-1, restbase/codfw — T402584 [15:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:57] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [15:01:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184546 (https://phabricator.wikimedia.org/T401701) (owner: 10Dreamy Jazz) [15:02:51] (03CR) 10Btullis: [C:03+2] Add the dse-k8s-codfw cluster to the k8s cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [15:03:04] (03Merged) 10jenkins-bot: Instrument CentralAuthUser::getBlocks [extensions/CentralAuth] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184546 (https://phabricator.wikimedia.org/T401701) (owner: 10Dreamy Jazz) [15:03:04] (03CR) 10Btullis: [C:03+2] Add the dse-k8s-codfw cluster to the k8s cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [15:03:30] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1184546|Instrument CentralAuthUser::getBlocks (T401701)]] [15:03:33] T401701: UserInfoCard: Queries performed by `CentralAuthUser::getBlocks` is uncached and performs lots of queries - https://phabricator.wikimedia.org/T401701 [15:03:56] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:35] (03PS6) 10Krinkle: varnish: Remove 60s cap for mobileaction/useformat on m-dot [puppet] - 10https://gerrit.wikimedia.org/r/1183212 (https://phabricator.wikimedia.org/T401595) [15:04:39] (03CR) 10Krinkle: varnish: Remove 60s cap for mobileaction/useformat on m-dot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183212 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [15:04:49] (03PS21) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [15:05:15] (03CR) 10Hnowlan: [C:03+1] changeprop-jobqueue: add CategoryCountUpdateJob rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184162 (https://phabricator.wikimedia.org/T402873) (owner: 10Scott French) [15:08:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144263 (10Papaul) For reference please see case number below ` Dear Papaul , Thank you for contacting Dell Technologies technical support, from this moment... [15:08:56] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:23] (03Merged) 10jenkins-bot: Add the dse-k8s-codfw cluster to the k8s cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1184520 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [15:09:42] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1012.eqiad.wmnet with OS trixie [15:09:48] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1184546|Instrument CentralAuthUser::getBlocks (T401701)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:09:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P82497 and previous config saved to /var/cache/conftool/dbconfig/20250903-150949-fceratto.json [15:09:51] T401701: UserInfoCard: Queries performed by `CentralAuthUser::getBlocks` is uncached and performs lots of queries - https://phabricator.wikimedia.org/T401701 [15:10:11] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [15:10:35] (03CR) 10Scott French: [C:03+2] cli.py: The mode/action argument is required [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172064 (owner: 10Ahmon Dancy) [15:10:47] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Revision-deletion, and 2 others: Revision deletion on image files is excessively slow - https://phabricator.wikimedia.org/T403572#11144269 (10Pppery) [15:13:25] (03Merged) 10jenkins-bot: cli.py: The mode/action argument is required [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172064 (owner: 10Ahmon Dancy) [15:13:29] (03PS1) 10Federico Ceratto: mysqld_exporter.pp: fix /var/log/prometheus perms [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) [15:13:29] (03CR) 10Federico Ceratto: "As discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [15:13:39] (03CR) 10Scott French: [C:03+2] tox.ini: Pass --diff to black [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172071 (owner: 10Ahmon Dancy) [15:13:56] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:14:22] PROBLEM - Host cr3-ulsfo.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:14:22] PROBLEM - Host cr4-ulsfo.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:14:36] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:14:36] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:14:36] that is me [15:14:38] PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:14:38] PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:15:12] (03PS1) 10Bking: dse-k8s-eqiad: Add ipoid-opensearch namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1184551 (https://phabricator.wikimedia.org/T403534) [15:15:33] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184546|Instrument CentralAuthUser::getBlocks (T401701)]] (duration: 12m 03s) [15:15:36] T401701: UserInfoCard: Queries performed by `CentralAuthUser::getBlocks` is uncached and performs lots of queries - https://phabricator.wikimedia.org/T401701 [15:15:40] thanks, papaul, ack [15:15:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184551 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [15:15:55] (03PS2) 10Federico Ceratto: mysqld_exporter.pp: fix /var/log/prometheus perms [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) [15:16:10] Dreamy_Jazz: do you have anything lined up after your backport that just wrapped up? [15:16:15] jouncebot: nowandnext [15:16:15] No deployments scheduled for the next 1 hour(s) and 43 minute(s) [15:16:15] In 1 hour(s) and 43 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1700) [15:16:25] (03Merged) 10jenkins-bot: tox.ini: Pass --diff to black [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172071 (owner: 10Ahmon Dancy) [15:16:37] (03CR) 10CI reject: [V:04-1] mysqld_exporter.pp: fix /var/log/prometheus perms [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [15:17:03] (03CR) 10BryanDavis: "Cause of T403616 in Beta Cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [15:17:08] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:17:14] PROBLEM - Host scs-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:17:47] (03PS1) 10Dzahn: site: add peopleweb role to new peopleweb hosts again [puppet] - 10https://gerrit.wikimedia.org/r/1184553 (https://phabricator.wikimedia.org/T403526) [15:18:43] (03PS3) 10Federico Ceratto: mysqld_exporter.pp: fix /var/log/prometheus perms [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) [15:18:56] FIRING: [3x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:56] (03PS2) 10Bking: dse-k8s-eqiad: Add ipoid-opensearch namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1184551 (https://phabricator.wikimedia.org/T403534) [15:19:20] (03CR) 10Scott French: [C:03+2] cli.py: Improve UX when config file does not exist [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172072 (owner: 10Ahmon Dancy) [15:19:32] (03PS2) 10Dzahn: site: add peopleweb role to new peopleweb hosts again [puppet] - 10https://gerrit.wikimedia.org/r/1184553 (https://phabricator.wikimedia.org/T403526) [15:20:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184551 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [15:21:06] (03CR) 10Thiemo Kreuz (WMDE): [C:03+2] TranslationUnitDTO: Make blob type properties writable [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184538 (https://phabricator.wikimedia.org/T402520) (owner: 10Abijeet Patro) [15:21:23] (03CR) 10Dzahn: [C:03+2] site: add peopleweb role to new peopleweb hosts again [puppet] - 10https://gerrit.wikimedia.org/r/1184553 (https://phabricator.wikimedia.org/T403526) (owner: 10Dzahn) [15:22:16] (03Merged) 10jenkins-bot: cli.py: Improve UX when config file does not exist [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172072 (owner: 10Ahmon Dancy) [15:22:30] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.81 ms [15:22:32] RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.05 ms [15:22:32] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.09 ms [15:22:36] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:22:36] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:23:56] FIRING: [3x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:21] (03PS1) 10Bking: dse-k8s-eqiad: Add ipoid-opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534) [15:24:36] RECOVERY - Host cr3-ulsfo.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.46 ms [15:24:36] RECOVERY - Host cr4-ulsfo.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.34 ms [15:24:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T401906)', diff saved to https://phabricator.wikimedia.org/P82498 and previous config saved to /var/cache/conftool/dbconfig/20250903-152457-fceratto.json [15:25:01] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:25:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2195.codfw.wmnet with reason: Maintenance [15:25:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T401906)', diff saved to https://phabricator.wikimedia.org/P82499 and previous config saved to /var/cache/conftool/dbconfig/20250903-152519-fceratto.json [15:25:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184160 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza) [15:27:28] RECOVERY - Host scs-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.90 ms [15:27:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T401906)', diff saved to https://phabricator.wikimedia.org/P82500 and previous config saved to /var/cache/conftool/dbconfig/20250903-152729-fceratto.json [15:29:36] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 4/4 UP : 3 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:30:33] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:30:59] PROBLEM - Host cr3-ulsfo.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:30:59] PROBLEM - Host cr4-ulsfo.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:31:02] FYI, unless there are any objections, I'll be deploying a change to changeprop-jobqueue in a few minutes to add a new job processing rule for T402873 [15:31:03] T402873: Create dedicated changeprop-jobqueue rule for CategoryCountUpdateJob - https://phabricator.wikimedia.org/T402873 [15:31:21] PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:31:21] PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:31:37] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 6/6 UP : 5 v2 P2P interfaces vs. 6 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:31:57] (03PS4) 10Federico Ceratto: mysqld_exporter.pp: fix /var/log/prometheus perms [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) [15:32:10] (03CR) 10Scott French: "Thank you both for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184162 (https://phabricator.wikimedia.org/T402873) (owner: 10Scott French) [15:32:11] (03CR) 10Brouberol: dse-k8s-eqiad: Add ipoid-opensearch namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [15:32:34] (03CR) 10Brouberol: dse-k8s-eqiad: Add ipoid-opensearch namespaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184551 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [15:32:41] (03CR) 10CI reject: [V:04-1] mysqld_exporter.pp: fix /var/log/prometheus perms [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [15:33:11] (03CR) 10Scott French: [C:03+2] changeprop-jobqueue: add CategoryCountUpdateJob rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184162 (https://phabricator.wikimedia.org/T402873) (owner: 10Scott French) [15:33:51] PROBLEM - Host scs-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:33:56] RESOLVED: [3x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:04] (03Merged) 10jenkins-bot: changeprop-jobqueue: add CategoryCountUpdateJob rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184162 (https://phabricator.wikimedia.org/T402873) (owner: 10Scott French) [15:35:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:36:00] (03PS5) 10Federico Ceratto: mysqld_exporter.pp: fix /var/log/prometheus perms [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) [15:36:59] changeprop-jobqueue updates starting momentarily [15:37:16] (03Merged) 10jenkins-bot: TranslationUnitDTO: Make blob type properties writable [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184538 (https://phabricator.wikimedia.org/T402520) (owner: 10Abijeet Patro) [15:37:31] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [15:38:01] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [15:38:09] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11144417 (10Jhancock.wm) it didn't save the password change for whatever reason. you should be all set for real this time =) [15:38:37] !log ariel@deploy1003 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=no.wikipedia.org --logwiki=metawiki DSBinfo Nordlysoversola # T403581 [15:38:40] T403581: Unblock stuck global rename of Nordlysoversola - https://phabricator.wikimedia.org/T403581 [15:38:56] FIRING: [3x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:10] !log ariel@deploy1003 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=nowiki --logwiki=metawiki DSBinfo Nordlysoversola # T403581 [15:39:45] rzl: first time testing envoy on trixie. using an existing puppetized role. but somehow fails to start with "/etc/envoy/envoy.yaml': Unable to convert YAML as JSON" and that file is empty.. surprising.. but trying to debug some more why [15:42:13] PROBLEM - people.wikimedia.org requires authentication on people1005 is CRITICAL: connect to address 10.64.32.95 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:42:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P82501 and previous config saved to /var/cache/conftool/dbconfig/20250903-154237-fceratto.json [15:42:47] this alert is me. WIP and silencing [15:43:21] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:43:21] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.08 ms [15:43:21] RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.39 ms [15:43:23] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.75 ms [15:43:33] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on people1005.eqiad.wmnet with reason: debugging [15:43:37] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:43:37] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:44:05] RECOVERY - Host scs-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.53 ms [15:44:15] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:44:58] 10ops-codfw, 06DC-Ops: codfw: document SCS ports in Netbox - https://phabricator.wikimedia.org/T403634 (10ayounsi) 03NEW [15:46:26] RECOVERY - Host cr4-ulsfo.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.43 ms [15:46:26] RECOVERY - Host cr3-ulsfo.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.42 ms [15:47:26] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on people2004.codfw.wmnet with reason: debugging [15:48:19] rzl: you know what fixed it? rm -rf /etc/envoy/ and running puppet :) I had a vague memory that it happened to me in the past due to some race. [15:48:54] so the config file has content now and ..alright [15:48:56] RESOLVED: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:49:46] mutante: sorry just out of a meeting -- weird, glad that worked though [15:49:56] thanks for the callout, that's good to know [15:50:09] rzl: it was not meant to be time-sensitive at all :) [15:50:28] 👍 [15:50:31] 06SRE, 06Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11144499 (10Papaul) manually disable OSPF (using commit confirmed) make the mgmt goes down when done on mr1 or cr3/cr4 . But mr1-ulsfo.oob.wikimedia.org and mr1 loopback stil... [15:52:42] (03CR) 10Btullis: dse-k8s-eqiad: Add ipoid-opensearch namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [15:53:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144512 (10Papaul) Here is the email from Dell ` Dear Papaul, Thank you for your patience while we investigated the issue regarding the Virtualization Techno... [15:53:56] robh: elukey: update task with email from Dell [15:54:29] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:55:34] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:57:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P82502 and previous config saved to /var/cache/conftool/dbconfig/20250903-155744-fceratto.json [16:03:57] FYI, I'm done with my changeprop-jobqueue changes [16:07:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11144543 (10VRiley-WMF) 05Open→03Resolved This disk has been replaced. [16:11:33] !log dzahn@cumin2002 START - Cookbook sre.hosts.remove-downtime for people1005.eqiad.wmnet [16:11:34] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for people1005.eqiad.wmnet [16:11:42] !log dzahn@cumin2002 START - Cookbook sre.hosts.remove-downtime for people2004.codfw.wmnet [16:11:43] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for people2004.codfw.wmnet [16:12:20] PROBLEM - people.wikimedia.org requires authentication on people2004 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:12:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T401906)', diff saved to https://phabricator.wikimedia.org/P82503 and previous config saved to /var/cache/conftool/dbconfig/20250903-161252-fceratto.json [16:12:56] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [16:12:58] !log people1005 - systemctl start wmf_auto_restart_envoyproxy.service [16:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [16:17:27] 10ops-codfw, 06DC-Ops: codfw: document SCS ports in Netbox - https://phabricator.wikimedia.org/T403634#11144590 (10Papaul) a:03Papaul [16:19:39] (03CR) 10Elukey: [C:03+1] opensearch-operator: Add chart for review (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [16:21:12] (03CR) 10Bking: dse-k8s-eqiad: Add ipoid-opensearch namespaces (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [16:22:27] (03PS1) 10Jdlrobson: Cleanup special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184559 (https://phabricator.wikimedia.org/T400066) [16:22:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T402925)', diff saved to https://phabricator.wikimedia.org/P82504 and previous config saved to /var/cache/conftool/dbconfig/20250903-162237-ladsgroup.json [16:22:42] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [16:23:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144628 (10RobH) [16:27:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144668 (10RobH) [16:29:48] jouncebot: nowandnext [16:29:48] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [16:29:49] In 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1700) [16:29:59] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA the removal of m-dot subdomain - https://phabricator.wikimedia.org/T403638 (10MSantos) 03NEW [16:30:04] (03PS7) 10Cyndywikime: [Growth] enwiki: Deploy "Add a link" to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) [16:30:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [16:31:17] (03Merged) 10jenkins-bot: [Growth] enwiki: Deploy "Add a link" to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [16:32:53] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA the removal of m-dot subdomain - https://phabricator.wikimedia.org/T403638#11144734 (10MSantos) @Krinkle after a conversation with @NBaca-WMF we realised that the affected teams might want to do their own QA analysis and this will... [16:33:47] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA the removal of m-dot subdomain - https://phabricator.wikimedia.org/T403638#11144736 (10MSantos) a:05Krinkle→03None [16:34:32] uuu [16:34:48] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA the removal of m-dot subdomain - https://phabricator.wikimedia.org/T403638#11144742 (10Krinkle) We're deploying next week. There is no further need for dedicated testing afaik. We've prepared this back in March already. The rollou... [16:34:49] not cool https://www.irccloud.com/pastebin/xWUjLBWF/ [16:35:46] !log sudo cumin "A:cp" "disable-puppet 'merging CR 1180969'": T401595 [16:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:49] T401595: [Rollout Phase 1] Implement unified mobile routing and enable on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595 [16:36:46] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA the removal of m-dot subdomain - https://phabricator.wikimedia.org/T403638#11144746 (10Krinkle) Is there a specific feature you're worried about, and does that feature not have test coverage or QA regression testing on a regular b... [16:36:49] (03PS1) 10Urbanecm: Revert "TranslationUnitDTO: Make blob type properties writable" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184561 (https://phabricator.wikimedia.org/T402520) [16:36:57] (03CR) 10Urbanecm: [V:03+2 C:03+2] "to unblock deployment" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184561 (https://phabricator.wikimedia.org/T402520) (owner: 10Urbanecm) [16:37:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P82505 and previous config saved to /var/cache/conftool/dbconfig/20250903-163745-ladsgroup.json [16:38:11] (03CR) 10Ssingh: [C:03+2] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:40:13] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1179648|[Growth] enwiki: Deploy "Add a link" to 100% of users (T395524)]] [16:40:16] T395524: Add a link (Structured task): Increase rollout on English Wikipedia to 100% - https://phabricator.wikimedia.org/T395524 [16:40:43] (03CR) 10Jcrespo: "Shouldn't the process that requires access be added to the group, rather than adding free permissions to everyone?" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [16:41:05] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA the removal of m-dot subdomain - https://phabricator.wikimedia.org/T403638#11144762 (10MSantos) >>! In T403638#11144746, @Krinkle wrote: > Is there a specific feature you're worried about, and does that feature not have test cover... [16:41:46] (03CR) 10Jcrespo: "I say because: E.g. what if in the future, private mysql queries are stored there?" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [16:43:41] !log merging CR 1183212: T401595 [16:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:44] T401595: [Rollout Phase 1] Implement unified mobile routing and enable on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595 [16:44:13] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11144781 (10Krinkle) [16:44:34] (03CR) 10Ssingh: [C:03+1] "Comment addressed and in place, reviewed." [puppet] - 10https://gerrit.wikimedia.org/r/1183212 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:44:36] (03CR) 10Ssingh: [C:03+2] varnish: Remove 60s cap for mobileaction/useformat on m-dot [puppet] - 10https://gerrit.wikimedia.org/r/1183212 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:44:36] !log urbanecm@deploy1003 urbanecm, cyndywikime: Backport for [[gerrit:1179648|[Growth] enwiki: Deploy "Add a link" to 100% of users (T395524)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:44:55] !log urbanecm@deploy1003 urbanecm, cyndywikime: Continuing with sync [16:45:06] RECOVERY - MegaRAID on an-worker1128 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:48:29] (03CR) 10Ssingh: [C:03+2] varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:49:00] (03PS22) 10Ssingh: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:49:42] (03CR) 10Ssingh: [C:03+2] varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:49:54] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179648|[Growth] enwiki: Deploy "Add a link" to 100% of users (T395524)]] (duration: 09m 40s) [16:49:57] T395524: Add a link (Structured task): Increase rollout on English Wikipedia to 100% - https://phabricator.wikimedia.org/T395524 [16:51:25] (03PS1) 10Jdlrobson: WIP: Deploy dark mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184564 (https://phabricator.wikimedia.org/T395628) [16:52:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P82506 and previous config saved to /var/cache/conftool/dbconfig/20250903-165253-ladsgroup.json [16:56:40] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host maps1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:58:52] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11144837 (10Krinkle) >>! In T403638#11144733, @MSantos wrote: > However, to kickstart this work stream they need to know 2 things: > - What exactly needs... [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1700) [17:01:04] (03CR) 10Bking: [C:03+2] Introduce opensearch-operator-crds chart (1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173947 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:01:38] (03CR) 10Bking: [C:03+2] "self-merging, as the approval is implied by the +1 in the second patch in the chain" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173947 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:01:42] !log rolling out CRs 1180969, 1183212, 1180577, -b31 A:cp: T401595 [17:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:45] T401595: [Rollout Phase 1] Implement unified mobile routing and enable on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595 [17:02:23] (03Merged) 10jenkins-bot: Introduce opensearch-operator-crds chart (1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173947 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:04:11] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:04:23] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:06:24] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11144863 (10NBaca-WMF) Hm, I'm not sure that general-purpose QA flows will necessarily uncover potential breaking changes here. I also think teams and tea... [17:06:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144864 (10RobH) [17:07:14] (03CR) 10David Caro: "Tested in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1184484 (https://phabricator.wikimedia.org/T182892) (owner: 10David Caro) [17:08:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T402925)', diff saved to https://phabricator.wikimedia.org/P82507 and previous config saved to /var/cache/conftool/dbconfig/20250903-170800-ladsgroup.json [17:08:04] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [17:08:16] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [17:09:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:11:08] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host maps1011.eqiad.wmnet with OS bookworm [17:11:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11144882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host maps1011.eqiad.wmnet with OS bookworm [17:12:54] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host maps1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:16:38] vriley@cumin1003 provision (PID 870302) is awaiting input [17:17:03] 10ops-codfw, 06SRE, 06DC-Ops: hw troubleshooting: SSD Firmware update for frbackup2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T394359#11144890 (10Jhancock.wm) →14Duplicate dup:03T396649 [17:17:05] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649#11144892 (10Jhancock.wm) [17:17:10] 10ops-codfw, 06SRE, 06DC-Ops: hw troubleshooting: SSD Firmware update for frbackup2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T394359#11144896 (10Jhancock.wm) completed in T396649 back in June [17:17:37] (03PS1) 10Herron: thanos-store: set cutoff days to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1184566 (https://phabricator.wikimedia.org/T349521) [17:18:13] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host maps1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:19:51] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:25:04] 06SRE, 13Patch-For-Review: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#11144942 (10bd808) [17:29:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:30:44] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [17:33:10] (03PS1) 10Bking: dse-k8s: Introduce opensearch-operator namespace [puppet] - 10https://gerrit.wikimedia.org/r/1184568 (https://phabricator.wikimedia.org/T397246) [17:33:56] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:34:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:35:44] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt maps1011 - vriley@cumin1003" [17:35:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11144994 (10RobH) cp2045 has had the idrac, bios, and SSD firmware updated to latest revisions to match cp2043. Please note that these have been brought online... [17:35:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt maps1011 - vriley@cumin1003" [17:35:49] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:36:30] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host maps1014 [17:37:05] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11144999 (10ssingh) a:03CDobbins [17:37:43] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host maps1014 [17:39:10] (03CR) 10Mstyles: [C:03+1] OATHAuth: Add Config Variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [17:39:32] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host maps1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:39:32] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1011.eqiad.wmnet with reason: host reimage [17:41:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) (owner: 10Chlod Alejandro) [17:43:26] (03PS4) 10Mstyles: OATHAuth: Add Config Variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) [17:44:04] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1011.eqiad.wmnet with reason: host reimage [17:45:08] (03PS5) 10Mstyles: OATHAuth: Add Config Variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) [17:45:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11145046 (10VRiley-WMF) [17:46:56] (03CR) 10Cathal Mooney: [C:03+2] Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [17:47:31] (03CR) 10Dzahn: [C:03+2] modules/scap/templates/patches-pre-commit-hook.erb: Remove tabs [puppet] - 10https://gerrit.wikimedia.org/r/1184548 (owner: 10Ahmon Dancy) [17:48:15] (03Merged) 10jenkins-bot: Srl_system: small fixes to make config apply and no diff [homer/public] - 10https://gerrit.wikimedia.org/r/1184486 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [17:48:16] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host maps1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:48:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11145066 (10ssingh) >>! In T392851#11144512, @Papaul wrote: > Here is the email from Dell > ` > Dear Papaul, > > Thank you for your patience while we investig... [17:48:56] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:52:55] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host maps1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:52:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184120 (https://phabricator.wikimedia.org/T389231) (owner: 10DLynch) [17:53:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184120 (https://phabricator.wikimedia.org/T389231) (owner: 10DLynch) [17:54:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184124 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch) [17:54:40] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:56:23] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host maps1012.eqiad.wmnet with OS bookworm [17:56:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11145097 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host maps1012.eqiad.wmnet with OS bookworm [17:59:36] vriley@cumin1003 provision (PID 873411) is awaiting input [18:00:05] dancy and andre: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T1800). [18:00:27] 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 06Traffic-Icebox, 07Privacy: Add request_id to webrequest logs as well as other event records ingested into Hadoop - https://phabricator.wikimedia.org/T113817#11145138 (10Ottomata) [18:00:28] 06SRE, 06Traffic-Icebox: Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11145139 (10Ottomata) [18:00:35] (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver: rename mw-php-migration to mw-next-routing [puppet] - 10https://gerrit.wikimedia.org/r/1154900 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [18:03:09] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [18:04:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [18:04:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps1011.eqiad.wmnet with OS bookworm [18:04:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11145152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host maps1011.eqiad.wmnet with OS bookworm completed: - maps1011 (**PASS**) - Rem... [18:06:37] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:07:31] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184570 (https://phabricator.wikimedia.org/T396378) [18:07:33] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184570 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot) [18:07:40] (03CR) 10Brouberol: [C:03+1] dse-k8s: Introduce opensearch-operator namespace [puppet] - 10https://gerrit.wikimedia.org/r/1184568 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [18:07:47] (03CR) 10Alexandros Kosiaris: [C:03+1] "I didn't review the tests, given they pass CI already. The logic LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1154901 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [18:08:07] (03CR) 10Dzahn: "I assumed there is no other step needed besides submitting it. Seems like there is more to it?" [puppet] - 10https://gerrit.wikimedia.org/r/1180234 (https://phabricator.wikimedia.org/T402284) (owner: 10Dzahn) [18:10:14] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184570 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot) [18:19:00] (03CR) 10Scott French: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1154900 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [18:19:23] (03CR) 10Scott French: [C:03+2] trafficserver: rename mw-php-migration to mw-next-routing [puppet] - 10https://gerrit.wikimedia.org/r/1154900 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [18:20:00] (03CR) 10Scott French: [C:03+2] trafficserver: generalize mw-next-routing.lua and prep for PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1154901 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [18:20:00] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.17 refs T396378 [18:20:03] T396378: 1.45.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T396378 [18:22:11] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11145215 (10Jgreen) >>! In T400275#11144417, @Jhancock.wm wrote: > it didn't save the password change for whatever reason. you should be all set for real this time =) All good now... [18:23:24] (03PS5) 10Scott French: trafficserver: generalize mw-next-routing.lua and prep for PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1154901 (https://phabricator.wikimedia.org/T391421) [18:24:48] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1012.eqiad.wmnet with reason: host reimage [18:26:36] (03CR) 10Scott French: [C:03+2] trafficserver: generalize mw-next-routing.lua and prep for PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1154901 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [18:28:07] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1012.eqiad.wmnet with reason: host reimage [18:28:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:29:13] (03CR) 10Ssingh: [C:03+1] "Yeah submitting it and making sure the CDN is happy. I can take care of it tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1180234 (https://phabricator.wikimedia.org/T402284) (owner: 10Dzahn) [18:33:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:33:56] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [18:34:49] (03CR) 10Dzahn: "ack, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1180234 (https://phabricator.wikimedia.org/T402284) (owner: 10Dzahn) [18:37:55] (03CR) 10Catrope: OATHAuth: Add Config Variable (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [18:38:47] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11145300 (10Ejegg) Hi @ssingh , I'd be hesitant to start using a different cookie for CentralNotice - we do sometimes ne... [18:42:08] (03CR) 10Cwhite: [C:03+2] airflow: disable icinga nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/1184169 (https://phabricator.wikimedia.org/T384214) (owner: 10Cwhite) [18:47:13] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [18:47:42] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [18:47:43] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps1012.eqiad.wmnet with OS bookworm [18:47:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11145333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host maps1012.eqiad.wmnet with OS bookworm completed: - maps1012 (**PASS**) - Rem... [18:48:56] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:51:45] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host maps1013.eqiad.wmnet with OS bookworm [18:51:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11145354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host maps1013.eqiad.wmnet with OS bookworm [18:57:33] (03PS6) 10Mstyles: OATHAuth: Enable 2FA opt-in for 10% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) [18:57:44] (03CR) 10Mstyles: OATHAuth: Enable 2FA opt-in for 10% of users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [18:58:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:00:21] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11145362 (10ssingh) >>! In T122097#11145300, @Ejegg wrote: > Hi @ssingh , I'd be hesitant to start using a different coo... [19:03:56] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:03:59] (03PS1) 10Bking: opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [19:04:45] (03PS1) 10Dzahn: peopleweb: add additional rsync destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/1184573 (https://phabricator.wikimedia.org/T402596) [19:08:56] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:17:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:18:11] (03CR) 10Catrope: [C:03+1] OATHAuth: Enable 2FA opt-in for 10% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [19:19:58] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1013.eqiad.wmnet with reason: host reimage [19:23:29] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1013.eqiad.wmnet with reason: host reimage [19:34:49] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host maps1014.eqiad.wmnet with OS bookworm [19:35:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11145493 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host maps1014.eqiad.wmnet with OS bookworm [19:36:17] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance [19:36:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1247 (T402925)', diff saved to https://phabricator.wikimedia.org/P82508 and previous config saved to /var/cache/conftool/dbconfig/20250903-193624-ladsgroup.json [19:36:28] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [19:43:20] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [19:46:25] vriley@cumin1003 reimage (PID 884436) is awaiting input [19:46:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [19:46:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps1013.eqiad.wmnet with OS bookworm [19:47:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11145534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host maps1013.eqiad.wmnet with OS bookworm completed: - maps1013 (**PASS**) - Rem... [19:47:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11145535 (10VRiley-WMF) [19:48:53] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11145538 (10Ejegg) So as long as we switch CentralNotice to use the new cookie at the same time analytics switches the W... [19:51:46] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [19:59:58] (03PS1) 10Catrope: Fix display of Codex message icons [skins/Vector] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184583 (https://phabricator.wikimedia.org/T401457) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T2000). [20:00:04] danisztls, chlod, and Kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] (03PS1) 10Catrope: Fix display of Codex message icons [skins/Vector] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184584 (https://phabricator.wikimedia.org/T401457) [20:00:13] o/ [20:00:15] o/ here [20:00:30] o/ [20:00:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [skins/Vector] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184584 (https://phabricator.wikimedia.org/T401457) (owner: 10Catrope) [20:00:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [skins/Vector] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184584 (https://phabricator.wikimedia.org/T401457) (owner: 10Catrope) [20:00:59] I can self-deploy [20:01:09] Likewise. [20:01:44] i can't self-deploy :( [20:02:01] 🥺 [20:02:18] 👉👈 [20:02:22] chlod: I think I can deploy your patch together with mine [20:02:28] cool, thank you! [20:02:40] chlod: is it simple as it appears? [20:02:44] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1014.eqiad.wmnet with reason: host reimage [20:02:51] it is as simple as it appears [20:02:56] I can do my own too [20:03:09] I'll wait for danisztls to go first [20:03:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [skins/Vector] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184583 (https://phabricator.wikimedia.org/T401457) (owner: 10Catrope) [20:03:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184160 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza) [20:03:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) (owner: 10Chlod Alejandro) [20:04:30] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 (10RLazarus) 03NEW [20:04:43] (03Merged) 10jenkins-bot: Fix typo on newcomers survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184160 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza) [20:04:45] (03Merged) 10jenkins-bot: tlwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) (owner: 10Chlod Alejandro) [20:05:13] !log dani@deploy1003 Started scap sync-world: Backport for [[gerrit:1184160|Fix typo on newcomers survey (T402915)]], [[gerrit:1183750|tlwiktionary: add logos (T403433)]] [20:05:19] T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915 [20:05:19] T403433: Change Tagalog Wiktionary site logo to localized version - https://phabricator.wikimedia.org/T403433 [20:05:23] ah, it seems like there's one script that has to be run to purge the old logo from the Varnish cache. would that be okay? https://w.wiki/FEf4 [20:06:12] "XXwiki" in that command would be "tlwiktionary" in this case; and that's the only file that needs purging [20:06:27] chlod: I'm using spiderpig so I can't do that [20:06:32] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [20:06:42] ah, alright then [20:07:25] I'll run that for you now [20:07:38] RoanKattouw: should I interrupt spiderpig? [20:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:07:45] No need [20:07:52] RoanKattouw: ok [20:08:12] Oh and I'll have to wait for spiderpig to finish, sorry, so I'll run it once it's done [20:08:24] thank you both, and apologies for the worry [20:09:12] RoanKattouw: should I wait for chlod to test the changes? [20:09:37] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1014.eqiad.wmnet with reason: host reimage [20:10:21] They should try to test the changes but they might not be able to due caching. The changes might not be visible until they're fully deployed and I have run the purge script [20:10:31] So it's probably best to just continue without testing for this one [20:10:37] thanks [20:11:46] !log dani@deploy1003 chlod, dani: Backport for [[gerrit:1184160|Fix typo on newcomers survey (T402915)]], [[gerrit:1183750|tlwiktionary: add logos (T403433)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:11:51] T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915 [20:11:51] T403433: Change Tagalog Wiktionary site logo to localized version - https://phabricator.wikimedia.org/T403433 [20:12:24] logo looks good :) [20:12:25] chlod: your change is on the testing servers if you want to check it [20:12:28] great! [20:12:34] !log dani@deploy1003 chlod, dani: Continuing with sync [20:12:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:14:41] chlod: Even though I wasn't able to properly deploy your change it saved everyone a few minutes [20:15:06] indeed so. many thanks! [20:17:09] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:17:56] !log dani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184160|Fix typo on newcomers survey (T402915)]], [[gerrit:1183750|tlwiktionary: add logos (T403433)]] (duration: 12m 43s) [20:18:01] T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915 [20:18:01] T403433: Change Tagalog Wiktionary site logo to localized version - https://phabricator.wikimedia.org/T403433 [20:18:16] RoanKattouw: all yours to proceed with chlod change [20:19:26] If Roan's not here right now, I could get mine. [20:19:56] Running the purgeList script now [20:20:09] Done, Kemayo go ahead [20:20:15] And then I'll go after you [20:21:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184124 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch) [20:21:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184120 (https://phabricator.wikimedia.org/T389231) (owner: 10DLynch) [20:21:13] And chlod please check that the logos are updated now [20:21:59] all logos looking good :) [20:22:00] (03Merged) 10jenkins-bot: Edit check: deploy tone a/b test to frwiki, jawiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184120 (https://phabricator.wikimedia.org/T389231) (owner: 10DLynch) [20:22:03] thank you both again for the deploy! [20:25:46] (03PS1) 10Ebernhardson: dumps: Sync cirrus index dumps from hdfs [puppet] - 10https://gerrit.wikimedia.org/r/1184585 (https://phabricator.wikimedia.org/T366248) [20:27:30] 06SRE, 06Traffic: Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11145677 (10Milimetric) I'm being bold and removing this from the Icebox so we can talk about it. I've heard a few different folks talk about related needs, but I'll just de... [20:27:31] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:29:03] !log Created cusi_case on testwiki extension1 [20:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:41] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11145682 (10Jgreen) [20:30:15] !log Created cusi_user on testwiki extension1 - T403473 [20:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:19] T403473: Create suggested investigation database tables on WMF production - https://phabricator.wikimedia.org/T403473 [20:30:34] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11145687 (10RLazarus) [20:31:25] !log Created cusi_signal on testwiki extension1 - T403473 [20:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:50] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:31:51] Kemayo: I'm gonna grab some lunch, could you ping me when your deploy is done? I imagine you're still waiting on CI [20:31:59] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11145695 (10RLazarus) Removed the tracing item > (1.29.10) tracing: Fixed a bug where the OpenTelemetry tracer exports the OTLP request even when no spans are present. after consulting with our tracing exper... [20:32:12] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:32:13] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps1014.eqiad.wmnet with OS bookworm [20:32:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11145696 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host maps1014.eqiad.wmnet with OS bookworm completed: - maps1014 (**PASS**) - Rem... [20:32:24] RoanKattouw: It should be just about to finish CI, according to the test status, but it'll be another few minute for testing and deploy after that. [20:32:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11145697 (10VRiley-WMF) [20:33:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11145701 (10VRiley-WMF) 05Open→03Resolved this is resolved [20:33:52] (03Merged) 10jenkins-bot: Edit check: log to VEFU if a tone check would have been shown if not for the a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184124 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch) [20:34:19] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [20:34:20] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1184124|Edit check: log to VEFU if a tone check would have been shown if not for the a/b test (T394952)]], [[gerrit:1184120|Edit check: deploy tone a/b test to frwiki, jawiki, ptwiki (T389231)]] [20:34:25] T394952: Log edits when Tone Check would've been shown had someone not been in control group - https://phabricator.wikimedia.org/T394952 [20:34:25] T389231: Deploy config change to start the Tone Check A/B Test - https://phabricator.wikimedia.org/T389231 [20:36:14] !log Created cusi_user on frwiki, zhwiki, idwiki, jawiki, fawiki, ptwiki, trwiki, and enwiki [20:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:31] !log Created cusi_user on frwiki, zhwiki, idwiki, jawiki, fawiki, ptwiki, trwiki, and enwiki in the extension1 cluster - T403473 [20:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:34] T403473: Create suggested investigation database tables on WMF production - https://phabricator.wikimedia.org/T403473 [20:37:36] !log Created cusi_case on frwiki, zhwiki, idwiki, jawiki, fawiki, ptwiki, trwiki, and enwiki in the extension1 cluster - T403473 [20:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:22] !log Created cusi_signal on frwiki, zhwiki, idwiki, jawiki, fawiki, ptwiki, trwiki, and enwiki in the extension1 cluster - T403473 [20:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:58] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1184124|Edit check: log to VEFU if a tone check would have been shown if not for the a/b test (T394952)]], [[gerrit:1184120|Edit check: deploy tone a/b test to frwiki, jawiki, ptwiki (T389231)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:39:14] !log Created cusi_case on testwiki extension1 - T403473 [20:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:08] !log kemayo@deploy1003 kemayo: Continuing with sync [20:40:49] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:43:38] (03PS1) 10MusikAnimal: labs: log CommunityRequests channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184589 [20:45:01] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11145758 (10Jgreen) @Jhancock.wm sorry, we didn't manage to note which vlan these go in earlier. Could you switch the ports for both hosts into frack-fundraising-codfw? I think I... [20:45:32] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184124|Edit check: log to VEFU if a tone check would have been shown if not for the a/b test (T394952)]], [[gerrit:1184120|Edit check: deploy tone a/b test to frwiki, jawiki, ptwiki (T389231)]] (duration: 11m 12s) [20:45:37] T394952: Log edits when Tone Check would've been shown had someone not been in control group - https://phabricator.wikimedia.org/T394952 [20:45:37] T389231: Deploy config change to start the Tone Check A/B Test - https://phabricator.wikimedia.org/T389231 [20:45:45] RoanKattouw: you're up. [20:46:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [skins/Vector] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184584 (https://phabricator.wikimedia.org/T401457) (owner: 10Catrope) [20:46:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [skins/Vector] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184583 (https://phabricator.wikimedia.org/T401457) (owner: 10Catrope) [20:48:08] (03Merged) 10jenkins-bot: Fix display of Codex message icons [skins/Vector] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184584 (https://phabricator.wikimedia.org/T401457) (owner: 10Catrope) [20:48:43] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11145771 (10Papaul) @Jgreen it is already in that vlan https://netbox.wikimedia.org/dcim/interfaces/44404/edit/ [20:49:13] (03PS3) 10Dreamy Jazz: tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) [20:49:30] 06SRE, 10MediaWiki-Action-API, 06MW-Interfaces-Team, 07Wikimedia-production-error: Frequent HTTP 503 errors from MediaWiki API every 1 or 2 minutes - https://phabricator.wikimedia.org/T390438#11145773 (10matmarex) I noticed that all of these errors have `IP address: 172.16.xx.xx`, which is a private IP ran... [20:50:58] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11145775 (10Jgreen) >>! In T400275#11145771, @Papaul wrote: > @Jgreen it is already in that vlan > https://netbox.wikimedia.org/dcim/interfaces/44404/edit/ Yeahhhh, typing is har... [20:52:28] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11145776 (10Papaul) @Jgreen ok let me take care of it [20:56:16] 06SRE, 10MediaWiki-Action-API, 06MW-Interfaces-Team, 07Wikimedia-production-error: Frequent HTTP 503 errors from MediaWiki API every 1 or 2 minutes - https://phabricator.wikimedia.org/T390438#11145790 (10Dzahn) >>! In T390438#11145773, @matmarex wrote: > I noticed that all of these errors have `IP address:... [20:58:53] (03Merged) 10jenkins-bot: Fix display of Codex message icons [skins/Vector] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184583 (https://phabricator.wikimedia.org/T401457) (owner: 10Catrope) [20:59:23] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1184584|Fix display of Codex message icons (T401457)]], [[gerrit:1184583|Fix display of Codex message icons (T401457)]] [20:59:27] T401457: Message: Fix height of CSS-only message icon - https://phabricator.wikimedia.org/T401457 [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T2100) [21:02:49] Not deploying now, in case anyone else wants the window. [21:03:49] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:06:01] !log catrope@deploy1003 catrope: Backport for [[gerrit:1184584|Fix display of Codex message icons (T401457)]], [[gerrit:1184583|Fix display of Codex message icons (T401457)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:06:04] T401457: Message: Fix height of CSS-only message icon - https://phabricator.wikimedia.org/T401457 [21:07:26] !log catrope@deploy1003 catrope: Continuing with sync [21:09:30] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:12:44] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184584|Fix display of Codex message icons (T401457)]], [[gerrit:1184583|Fix display of Codex message icons (T401457)]] (duration: 13m 20s) [21:12:47] T401457: Message: Fix height of CSS-only message icon - https://phabricator.wikimedia.org/T401457 [21:13:19] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: modifiy DNS for frm2002 and frdb2002 - pt1979@cumin2002" [21:13:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: modifiy DNS for frm2002 and frdb2002 - pt1979@cumin2002" [21:13:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:15] (03CR) 10Cwhite: [C:03+2] hiera: disable monitoring for legacy profile::airflow::instances [puppet] - 10https://gerrit.wikimedia.org/r/1184170 (https://phabricator.wikimedia.org/T384214) (owner: 10Cwhite) [21:20:23] (03PS2) 10Cwhite: hiera: disable monitoring for legacy profile::airflow::instances [puppet] - 10https://gerrit.wikimedia.org/r/1184170 (https://phabricator.wikimedia.org/T384214) [21:21:32] (03CR) 10Cwhite: [C:03+2] hiera: disable monitoring for legacy profile::airflow::instances [puppet] - 10https://gerrit.wikimedia.org/r/1184170 (https://phabricator.wikimedia.org/T384214) (owner: 10Cwhite) [21:25:12] jhathaway@cumin1002 reimage (PID 250482) is awaiting input [21:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:30:50] (03PS7) 10Bking: dse-k8s-worker: Add sysctl setting that's required for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T402926) [21:31:04] (03CR) 10Ryan Kemper: [C:03+1] dse-k8s-worker: Add sysctl setting that's required for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T402926) (owner: 10Bking) [21:32:03] (03PS8) 10Bking: dse-k8s-worker: Add sysctl setting that's required for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T402926) [21:33:56] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:34:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:37:52] !log Running `mwscript-k8s -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --quick --zType Z4 --verbose` to try to fix T403671 [21:37:53] (03CR) 10Btullis: [C:03+1] dse-k8s-worker: Add sysctl setting that's required for OpenSearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T402926) (owner: 10Bking) [21:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:55] T403671: Partial Wikifunctions service outage: Z504s (not found) being thrown, plus Type serialisation is failing - https://phabricator.wikimedia.org/T403671 [21:43:26] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [21:48:56] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:50:37] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [21:54:03] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11145932 (10Jgreen) Hosts are up and talking to the puppetservers, all good thank you! [21:56:19] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [21:58:42] (03CR) 10Xcollazo: [C:03+1] dumps: Sync cirrus index dumps from hdfs [puppet] - 10https://gerrit.wikimedia.org/r/1184585 (https://phabricator.wikimedia.org/T366248) (owner: 10Ebernhardson) [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T2200) [22:08:52] RECOVERY - snapshot of x1 in eqiad on backupmon1001 is OK: Last snapshot for x1 at eqiad (db1216) taken on 2025-09-03 21:39:17 (308 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [22:09:40] * Jdlrobson loads spiderpig [22:10:10] (03CR) 10Jdlrobson: "Want me to land this @kartik.mistry@gmail.com or would you prefer to?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 (owner: 10Jdlrobson) [22:10:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184559 (https://phabricator.wikimedia.org/T400066) (owner: 10Jdlrobson) [22:11:03] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11145970 (10Papaul) 05Open→03Resolved [22:11:25] (03Merged) 10jenkins-bot: Cleanup special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184559 (https://phabricator.wikimedia.org/T400066) (owner: 10Jdlrobson) [22:11:50] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1184559|Cleanup special wikis (T400066)]] [22:11:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T402925)', diff saved to https://phabricator.wikimedia.org/P82509 and previous config saved to /var/cache/conftool/dbconfig/20250903-221151-ladsgroup.json [22:11:54] T400066: Clean up Web-maintained settings on ex-wikipedia special wikis - https://phabricator.wikimedia.org/T400066 [22:11:57] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [22:16:42] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1184559|Cleanup special wikis (T400066)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:18:10] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [22:23:37] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184559|Cleanup special wikis (T400066)]] (duration: 11m 47s) [22:23:41] T400066: Clean up Web-maintained settings on ex-wikipedia special wikis - https://phabricator.wikimedia.org/T400066 [22:26:23] all done. [22:27:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P82510 and previous config saved to /var/cache/conftool/dbconfig/20250903-222659-ladsgroup.json [22:27:14] 06SRE, 06Infrastructure-Foundations: offboard-user: Check for use of email address of user to be offboarded across Puppet repo - https://phabricator.wikimedia.org/T403452#11146012 (10Dzahn) > I do not know if .. offboard-user.py could grep the Puppet repository for the email address of the user to be offboarde... [22:33:56] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [22:39:30] 06SRE, 06Infrastructure-Foundations: offboard-user: Check for use of email address of user to be offboarded across Puppet repo - https://phabricator.wikimedia.org/T403452#11146065 (10Dzahn) another approach to all this could be: - identify (puppet/ all) code that has "email address-like patterns" (https://e-m... [22:42:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P82511 and previous config saved to /var/cache/conftool/dbconfig/20250903-224206-ladsgroup.json [22:42:40] (03CR) 10Dzahn: [V:04-1] "parameter 'rsync_dst_host' expects a Stdlib::Host" [puppet] - 10https://gerrit.wikimedia.org/r/1184573 (https://phabricator.wikimedia.org/T402596) (owner: 10Dzahn) [22:44:25] (03CR) 10Dzahn: [V:04-1] "rsync::quickdatacopy supports multiple dest hosts.. just the more specific class using it does not" [puppet] - 10https://gerrit.wikimedia.org/r/1184573 (https://phabricator.wikimedia.org/T402596) (owner: 10Dzahn) [22:45:35] (03PS2) 10Cwhite: airflow: remove nrpe definitions [puppet] - 10https://gerrit.wikimedia.org/r/1184171 (https://phabricator.wikimedia.org/T384214) [22:46:27] (03PS3) 10Cwhite: airflow: remove nrpe definitions [puppet] - 10https://gerrit.wikimedia.org/r/1184171 (https://phabricator.wikimedia.org/T384214) [22:46:38] (03PS2) 10Dzahn: peopleweb: allow multiple rsync destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/1184573 (https://phabricator.wikimedia.org/T402596) [22:48:56] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:57] (03CR) 10Cwhite: [C:03+2] airflow: remove nrpe definitions [puppet] - 10https://gerrit.wikimedia.org/r/1184171 (https://phabricator.wikimedia.org/T384214) (owner: 10Cwhite) [22:51:34] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1184573/6841/" [puppet] - 10https://gerrit.wikimedia.org/r/1184573 (https://phabricator.wikimedia.org/T402596) (owner: 10Dzahn) [22:52:17] (03PS3) 10Dzahn: peopleweb: allow multiple rsync destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/1184573 (https://phabricator.wikimedia.org/T402596) [22:53:06] (03CR) 10Dzahn: [C:03+2] peopleweb: allow multiple rsync destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/1184573 (https://phabricator.wikimedia.org/T402596) (owner: 10Dzahn) [22:57:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T402925)', diff saved to https://phabricator.wikimedia.org/P82512 and previous config saved to /var/cache/conftool/dbconfig/20250903-225714-ladsgroup.json [22:57:18] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [22:57:31] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance [22:57:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1248 (T402925)', diff saved to https://phabricator.wikimedia.org/P82513 and previous config saved to /var/cache/conftool/dbconfig/20250903-225738-ladsgroup.json [22:59:08] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:03:56] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:08:56] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:11:03] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:11:05] (03CR) 10Cwhite: [C:03+1] nrpe2nodexp: add alertmanager_team param to override role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (owner: 10Tiziano Fogli) [23:15:43] FIRING: RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:15:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:17:08] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:20:43] RESOLVED: RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:23:22] RECOVERY - people.wikimedia.org requires authentication on people2004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:24:22] RECOVERY - people.wikimedia.org requires authentication on people1005 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184608 [23:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184608 (owner: 10TrainBranchBot) [23:38:37] !log Adding slack_bot_token to private repo - T401730 [23:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:41] T401730: Add a pathway for Alertmanager to send alerts in Slack - https://phabricator.wikimedia.org/T401730 [23:52:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184608 (owner: 10TrainBranchBot) [23:52:29] (03CR) 10Andrea Denisse: [C:03+1] "nit: I noticed a small typo in the commit message, it says "But: T395446" instead of "Bug: T395446"." [puppet] - 10https://gerrit.wikimedia.org/r/1184487 (owner: 10Tiziano Fogli) [23:55:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183700 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [23:56:41] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on testwiki in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183700 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [23:57:28] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1183700|Disable wmgUseMdotRouting on testwiki in prod (T401595)]] [23:57:31] T401595: [Rollout Phase 1] Implement unified mobile routing and enable on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595