[00:08:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170767 [00:08:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170767 (owner: 10TrainBranchBot) [00:31:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170767 (owner: 10TrainBranchBot) [00:52:38] FIRING: GoRoutinesTooHigh: gNMIc running on netflow2003 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [00:56:00] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:14:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:19:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:54:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:09:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:39:05] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:49:22] (03CR) 10Kevin Bazira: [C:03+2] ml-services: enable multiprocessing for kowiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170447 (https://phabricator.wikimedia.org/T363336) (owner: 10Kevin Bazira) [04:51:21] (03Merged) 10jenkins-bot: ml-services: enable multiprocessing for kowiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170447 (https://phabricator.wikimedia.org/T363336) (owner: 10Kevin Bazira) [04:52:38] FIRING: GoRoutinesTooHigh: gNMIc running on netflow2003 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [04:56:00] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:56:34] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [04:58:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:58:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:59:29] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [05:00:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:00:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:34:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:39:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:44:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:52:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2209.codfw.wmnet with reason: Maintenance [06:05:59] (03PS1) 10Marostegui: db2150: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170789 (https://phabricator.wikimedia.org/T399955) [06:07:42] (03CR) 10Marostegui: [C:03+2] db2150: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170789 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [06:09:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2150.codfw.wmnet with reason: Maintenance [06:09:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2150 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79446 and previous config saved to /var/cache/conftool/dbconfig/20250721-060923-marostegui.json [06:13:18] (03PS1) 10Marostegui: db1170: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170790 (https://phabricator.wikimedia.org/T399955) [06:14:38] (03CR) 10Marostegui: [C:03+2] db1170: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170790 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [06:16:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1170.eqiad.wmnet with reason: Maintenance [06:16:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1170 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79447 and previous config saved to /var/cache/conftool/dbconfig/20250721-061606-marostegui.json [06:17:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79448 and previous config saved to /var/cache/conftool/dbconfig/20250721-061726-root.json [06:23:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79449 and previous config saved to /var/cache/conftool/dbconfig/20250721-062353-root.json [06:31:26] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [06:32:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79450 and previous config saved to /var/cache/conftool/dbconfig/20250721-063232-root.json [06:38:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79451 and previous config saved to /var/cache/conftool/dbconfig/20250721-063859-root.json [06:39:05] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:41:44] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [06:44:18] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1170899 (https://phabricator.wikimedia.org/T400027) [06:44:23] (03PS1) 10Gerrit maintenance bot: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1170902 (https://phabricator.wikimedia.org/T400027) [06:44:54] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1170907 (https://phabricator.wikimedia.org/T400028) [06:44:59] (03PS1) 10Gerrit maintenance bot: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1170910 (https://phabricator.wikimedia.org/T400028) [06:47:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1039.eqiad.wmnet with reason: Maintenance [06:47:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79452 and previous config saved to /var/cache/conftool/dbconfig/20250721-064738-root.json [06:47:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1039', diff saved to https://phabricator.wikimedia.org/P79453 and previous config saved to /var/cache/conftool/dbconfig/20250721-064755-root.json [06:49:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:50:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es7 eqiad as read-only for maintenance', diff saved to https://phabricator.wikimedia.org/P79454 and previous config saved to /var/cache/conftool/dbconfig/20250721-065049-marostegui.json [06:50:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P79455 and previous config saved to /var/cache/conftool/dbconfig/20250721-065057-root.json [06:51:13] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Primary switchover es7 T400028 [06:51:17] T400028: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T400028 [06:54:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79456 and previous config saved to /var/cache/conftool/dbconfig/20250721-065405-root.json [06:54:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:55:58] (03CR) 10Arnaudb: [C:03+1] "thanks for the answers, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [06:57:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es7 eqiad back to read-write', diff saved to https://phabricator.wikimedia.org/P79457 and previous config saved to /var/cache/conftool/dbconfig/20250721-065710-marostegui.json [06:58:47] (03Abandoned) 10Marostegui: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1170910 (https://phabricator.wikimedia.org/T400028) (owner: 10Gerrit maintenance bot) [06:58:52] (03Abandoned) 10Marostegui: mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1170907 (https://phabricator.wikimedia.org/T400028) (owner: 10Gerrit maintenance bot) [07:00:05] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79458 and previous config saved to /var/cache/conftool/dbconfig/20250721-070243-root.json [07:04:28] (03Abandoned) 10Marostegui: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1170902 (https://phabricator.wikimedia.org/T400027) (owner: 10Gerrit maintenance bot) [07:04:33] (03Abandoned) 10Marostegui: mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1170899 (https://phabricator.wikimedia.org/T400027) (owner: 10Gerrit maintenance bot) [07:06:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79459 and previous config saved to /var/cache/conftool/dbconfig/20250721-070602-root.json [07:06:39] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [07:06:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:07:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance [07:09:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79461 and previous config saved to /var/cache/conftool/dbconfig/20250721-070910-root.json [07:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:09:38] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [07:09:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:10:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2235].codfw.wmnet,db[1164,1217].eqiad.wmnet with reason: Maintenance [07:14:14] (03PS1) 10WMDE-Fisch: Revert "VE: Enforce referenceslist reserialization when MW changed" [extensions/Cite] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171031 (https://phabricator.wikimedia.org/T400013) [07:14:44] (03PS2) 10WMDE-Fisch: Revert "VE: Enforce referenceslist reserialization when MW changed" [extensions/Cite] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171031 (https://phabricator.wikimedia.org/T400013) [07:14:49] I'm planning to deploy ^ this bugfix now. [07:14:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Maintenance in es6 [07:16:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es6 eqiad as read-only for maintenance - T400027', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250721-071604-marostegui.json [07:16:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:16:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Cite] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171031 (https://phabricator.wikimedia.org/T400013) (owner: 10WMDE-Fisch) [07:16:42] T400027: Switchover es6 master (es1038 -> es1037) - https://phabricator.wikimedia.org/T400027 [07:16:50] 06SRE, 10Citoid, 10CXServer, 10RESTBase, and 2 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#11019298 (10Nikerabbit) [07:17:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es6 eqiad back to read-write - T400027', diff saved to https://phabricator.wikimedia.org/P79463 and previous config saved to /var/cache/conftool/dbconfig/20250721-071744-marostegui.json [07:17:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/Cite] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171031 (https://phabricator.wikimedia.org/T400013) (owner: 10WMDE-Fisch) [07:19:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove weight from es7 master', diff saved to https://phabricator.wikimedia.org/P79464 and previous config saved to /var/cache/conftool/dbconfig/20250721-071949-marostegui.json [07:20:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1037', diff saved to https://phabricator.wikimedia.org/P79465 and previous config saved to /var/cache/conftool/dbconfig/20250721-072037-root.json [07:21:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79466 and previous config saved to /var/cache/conftool/dbconfig/20250721-072108-root.json [07:21:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:23:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P79467 and previous config saved to /var/cache/conftool/dbconfig/20250721-072313-root.json [07:24:33] 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#11019338 (10FCeratto-WMF) [07:24:51] 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#11019343 (10FCeratto-WMF) Related to T384810 [07:27:04] When is the train cutoff for MediaWiki extensions normally? Friday or Monday? [07:28:41] Jhs: The branch cut is early morning UTC on Tuesday, for example https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T0200 [07:30:12] (03Merged) 10jenkins-bot: Revert "VE: Enforce referenceslist reserialization when MW changed" [extensions/Cite] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171031 (https://phabricator.wikimedia.org/T400013) (owner: 10WMDE-Fisch) [07:30:34] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1171031|Revert "VE: Enforce referenceslist reserialization when MW changed" (T400013 T396017)]] [07:30:40] T400013: VisualEditor deletes list-defined references if there's a reference containing an ISBN and magic linking is enabled - https://phabricator.wikimedia.org/T400013 [07:30:40] T396017: VE should save updated main+details main bodies as references list items when converting to Parsoid DOM - https://phabricator.wikimedia.org/T396017 [07:36:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79468 and previous config saved to /var/cache/conftool/dbconfig/20250721-073614-root.json [07:38:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79469 and previous config saved to /var/cache/conftool/dbconfig/20250721-073819-root.json [07:41:04] (03PS1) 10Marostegui: db2142: Remove testing comment [puppet] - 10https://gerrit.wikimedia.org/r/1171143 [07:41:57] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1171143 (owner: 10Marostegui) [07:41:58] (03CR) 10Marostegui: [C:03+2] db2142: Remove testing comment [puppet] - 10https://gerrit.wikimedia.org/r/1171143 (owner: 10Marostegui) [07:47:43] (03PS1) 10Marostegui: db1174: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171144 (https://phabricator.wikimedia.org/T399955) [07:49:26] (03CR) 10Marostegui: [C:03+2] db1174: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171144 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [07:50:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:50:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1174 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79470 and previous config saved to /var/cache/conftool/dbconfig/20250721-075025-marostegui.json [07:51:08] (03PS1) 10Marostegui: db2159: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171145 (https://phabricator.wikimedia.org/T399955) [07:51:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79471 and previous config saved to /var/cache/conftool/dbconfig/20250721-075120-root.json [07:51:25] !log awight@deploy1003 wmde-fisch, awight: Backport for [[gerrit:1171031|Revert "VE: Enforce referenceslist reserialization when MW changed" (T400013 T396017)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:51:30] T400013: VisualEditor deletes list-defined references if there's a reference containing an ISBN and magic linking is enabled - https://phabricator.wikimedia.org/T400013 [07:51:31] T396017: VE should save updated main+details main bodies as references list items when converting to Parsoid DOM - https://phabricator.wikimedia.org/T396017 [07:51:37] (03CR) 10Marostegui: [C:03+2] db2159: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171145 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [07:51:53] 10SRE-swift-storage, 10MinT, 10LPL Essential (2025 Jul-Sep), 10LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#11019380 (10KartikMistry) @Dzahn Done. Thanks! [07:52:22] awight, aight (haha, punny!), thanks! [07:52:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2159.codfw.wmnet with reason: Maintenance [07:52:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2159 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79472 and previous config saved to /var/cache/conftool/dbconfig/20250721-075256-marostegui.json [07:53:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79473 and previous config saved to /var/cache/conftool/dbconfig/20250721-075325-root.json [07:54:16] (03PS12) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [07:54:16] (03PS7) 10Elukey: DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170463 [07:54:58] !log awight@deploy1003 wmde-fisch, awight: Continuing with sync [07:55:04] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:58:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79474 and previous config saved to /var/cache/conftool/dbconfig/20250721-075817-root.json [07:59:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:00:22] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:00:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79475 and previous config saved to /var/cache/conftool/dbconfig/20250721-080047-root.json [08:01:25] (03CR) 10CI reject: [V:04-1] DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170463 (owner: 10Elukey) [08:01:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Maintenance in s3 [08:02:32] (03CR) 10CI reject: [V:04-1] WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [08:05:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2205 with weight 0 T399930', diff saved to https://phabricator.wikimedia.org/P79476 and previous config saved to /var/cache/conftool/dbconfig/20250721-080528-root.json [08:05:33] T399930: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T399930 [08:05:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T399930 [08:06:07] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1170499 (https://phabricator.wikimedia.org/T399930) (owner: 10Gerrit maintenance bot) [08:07:47] !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171031|Revert "VE: Enforce referenceslist reserialization when MW changed" (T400013 T396017)]] (duration: 37m 13s) [08:07:52] T400013: VisualEditor deletes list-defined references if there's a reference containing an ISBN and magic linking is enabled - https://phabricator.wikimedia.org/T400013 [08:07:52] T396017: VE should save updated main+details main bodies as references list items when converting to Parsoid DOM - https://phabricator.wikimedia.org/T396017 [08:08:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79477 and previous config saved to /var/cache/conftool/dbconfig/20250721-080831-root.json [08:08:36] !log Starting s3 codfw failover from db2209 to db2205 - T399930 [08:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2205 to s3 primary T399930', diff saved to https://phabricator.wikimedia.org/P79478 and previous config saved to /var/cache/conftool/dbconfig/20250721-080907-marostegui.json [08:09:26] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [08:09:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2209 T399930', diff saved to https://phabricator.wikimedia.org/P79479 and previous config saved to /var/cache/conftool/dbconfig/20250721-080951-marostegui.json [08:11:00] (03PS1) 10Marostegui: db2209: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171146 (https://phabricator.wikimedia.org/T399548) [08:11:19] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Noarave from WMF systems - https://phabricator.wikimedia.org/T399953#11019438 (10joanna_borun) [08:12:08] (03CR) 10Marostegui: [C:03+2] db2209: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171146 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [08:12:24] (03PS1) 10Bartosz Wójtowicz: ml-services: Update image version for revertrisk models on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171147 (https://phabricator.wikimedia.org/T383119) [08:13:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79480 and previous config saved to /var/cache/conftool/dbconfig/20250721-081323-root.json [08:15:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79481 and previous config saved to /var/cache/conftool/dbconfig/20250721-081553-root.json [08:17:34] 10ops-eqiad, 06DC-Ops: Supermicro incorrectly exposing LinkStatus in Redfish - https://phabricator.wikimedia.org/T400034 (10ayounsi) 03NEW p:05Triage→03Low [08:18:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11019463 (10elukey) Thanks! I was able to provision the host and end up in Debian install, progress :) Of course now it fails since we didn't really configure any vali... [08:20:23] 10ops-eqiad, 06DC-Ops: Supermicro incorrectly exposing LinkStatus in Redfish - https://phabricator.wikimedia.org/T400034#11019464 (10ayounsi) [08:20:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:21:05] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [08:23:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79482 and previous config saved to /var/cache/conftool/dbconfig/20250721-082313-root.json [08:23:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79483 and previous config saved to /var/cache/conftool/dbconfig/20250721-082337-root.json [08:26:05] !log slow morning deployment finished [08:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79484 and previous config saved to /var/cache/conftool/dbconfig/20250721-082829-root.json [08:28:48] (03PS1) 10Elukey: install_server: update partman config for ml-serve101[23] [puppet] - 10https://gerrit.wikimedia.org/r/1171148 (https://phabricator.wikimedia.org/T393948) [08:30:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79485 and previous config saved to /var/cache/conftool/dbconfig/20250721-083058-root.json [08:31:58] (03PS2) 10Elukey: install_server: update partman config for ml-serve101[23] [puppet] - 10https://gerrit.wikimedia.org/r/1171148 (https://phabricator.wikimedia.org/T393948) [08:38:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79486 and previous config saved to /var/cache/conftool/dbconfig/20250721-083819-root.json [08:43:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79487 and previous config saved to /var/cache/conftool/dbconfig/20250721-084334-root.json [08:44:35] (03CR) 10WMDE-Fisch: [C:03+1] InitialiseSettings: Update comment about wgPopupsConflictingRefTooltipsGadgetName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170736 (https://phabricator.wikimedia.org/T362771) (owner: 10Reedy) [08:46:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79488 and previous config saved to /var/cache/conftool/dbconfig/20250721-084604-root.json [08:51:24] (03PS1) 10Marostegui: db2168: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171153 (https://phabricator.wikimedia.org/T399955) [08:52:07] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [08:52:36] (03CR) 10Marostegui: [C:03+2] db2168: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171153 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [08:52:38] FIRING: GoRoutinesTooHigh: gNMIc running on netflow2003 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [08:53:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79489 and previous config saved to /var/cache/conftool/dbconfig/20250721-085325-root.json [08:53:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2168.codfw.wmnet with reason: Maintenance [08:53:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2168 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79490 and previous config saved to /var/cache/conftool/dbconfig/20250721-085333-marostegui.json [08:56:00] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:56:02] (03PS1) 10Marostegui: db1191: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171154 (https://phabricator.wikimedia.org/T399955) [08:56:34] (03CR) 10Marostegui: [C:03+2] db1191: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171154 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [08:57:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1191.eqiad.wmnet with reason: Maintenance [08:57:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1191 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79491 and previous config saved to /var/cache/conftool/dbconfig/20250721-085719-marostegui.json [08:58:10] (03PS1) 10Giuseppe Lavagetto: Deploy schema-upgrades and dropping of bw_limit_duration [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1171155 (https://phabricator.wikimedia.org/T399534) [08:59:19] !log restarting haproxykafka service on cp5017 [08:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:13] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Deploy schema-upgrades and dropping of bw_limit_duration [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1171155 (https://phabricator.wikimedia.org/T399534) (owner: 10Giuseppe Lavagetto) [09:01:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79492 and previous config saved to /var/cache/conftool/dbconfig/20250721-090119-root.json [09:02:39] (03PS1) 10Marostegui: installserver: Do not reimage es2047 [puppet] - 10https://gerrit.wikimedia.org/r/1171156 [09:04:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79493 and previous config saved to /var/cache/conftool/dbconfig/20250721-090452-root.json [09:05:11] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [09:05:14] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es2047 [puppet] - 10https://gerrit.wikimedia.org/r/1171156 (owner: 10Marostegui) [09:06:16] (03CR) 10Lucas Werkmeister (WMDE): Enable wbui2025 mobile user interface on Wikidata Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703) (owner: 10Arthur taylor) [09:08:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79494 and previous config saved to /var/cache/conftool/dbconfig/20250721-090830-root.json [09:08:50] (03PS1) 10Marostegui: db1246: Remove yaml [puppet] - 10https://gerrit.wikimedia.org/r/1171157 (https://phabricator.wikimedia.org/T399449) [09:08:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1246.eqiad.wmnet [09:09:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:09:21] (03CR) 10Marostegui: [C:03+2] db1246: Remove yaml [puppet] - 10https://gerrit.wikimedia.org/r/1171157 (https://phabricator.wikimedia.org/T399449) (owner: 10Marostegui) [09:10:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:14:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:14:42] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [09:15:11] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Drop bw_limit_duration from haproxy_action - oblivian@cumin1003" [09:15:13] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Drop bw_limit_duration from haproxy_action - oblivian@cumin1003 [09:15:47] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Drop bw_limit_duration from haproxy_action - oblivian@cumin1003 [09:15:48] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Drop bw_limit_duration from haproxy_action - oblivian@cumin1003" [09:16:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79495 and previous config saved to /var/cache/conftool/dbconfig/20250721-091625-root.json [09:18:14] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1246.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:18:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1246.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:18:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:18:43] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db1246.eqiad.wmnet [09:19:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:19:29] (03CR) 10Volans: [C:03+1] "Sure, go ahead ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) (owner: 10Cathal Mooney) [09:19:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79496 and previous config saved to /var/cache/conftool/dbconfig/20250721-091958-root.json [09:19:59] (03PS1) 10Marostegui: site.pp: Fix backup sources comments [puppet] - 10https://gerrit.wikimedia.org/r/1171159 (https://phabricator.wikimedia.org/T350458) [09:20:18] !log manually install bird2_2.17.1+branch.mq.bgp.multilisten.c47b08 on ganeti2033 and ganeti700x - T362392 [09:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:22] T362392: Routed Ganeti: Add support for VM BGP - https://phabricator.wikimedia.org/T362392 [09:20:56] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db1246.eqiad.wmnet - https://phabricator.wikimedia.org/T399449#11019595 (10Marostegui) a:05Marostegui→03VRiley-WMF @VRiley-WMF this host has been "decommissioned" from our side, not sure what would be pending from... [09:21:07] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db1246.eqiad.wmnet - https://phabricator.wikimedia.org/T399449#11019601 (10Marostegui) [09:21:47] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1171159 (https://phabricator.wikimedia.org/T350458) (owner: 10Marostegui) [09:21:49] (03CR) 10Marostegui: [C:03+2] site.pp: Fix backup sources comments [puppet] - 10https://gerrit.wikimedia.org/r/1171159 (https://phabricator.wikimedia.org/T350458) (owner: 10Marostegui) [09:23:31] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db1246.eqiad.wmnet - https://phabricator.wikimedia.org/T399449#11019605 (10Marostegui) >>! In T399449#11019588, @ops-monitoring-bot wrote: > cookbooks.sre.hosts.decommission executed by marostegui@cumin1002 for hosts:... [09:23:54] (03PS1) 10Volans: setup.py: pin prospector [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 [09:24:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:29:30] (03PS2) 10Volans: setup.py: pin prospector [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 [09:31:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79497 and previous config saved to /var/cache/conftool/dbconfig/20250721-093131-root.json [09:33:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:34:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:35:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79498 and previous config saved to /var/cache/conftool/dbconfig/20250721-093504-root.json [09:35:28] (03PS3) 10Volans: setup.py: pin prospector [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 [09:37:51] (03CR) 10Ayounsi: [C:03+1] "should you add a comment saying when we should look at unpinning it?" [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 (owner: 10Volans) [09:38:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:39:11] (03CR) 10Volans: "In other projects we're keeping it pinned and upgrading it when we want/have time to prevent new releases to make CI fails unexpectedly. S" [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 (owner: 10Volans) [09:40:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:40:42] (03PS1) 10Marostegui: db1194: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171163 (https://phabricator.wikimedia.org/T399955) [09:41:12] (03CR) 10Marostegui: [C:03+2] db1194: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171163 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [09:41:13] (03CR) 10Elukey: [C:03+2] install_server: update partman config for ml-serve101[23] [puppet] - 10https://gerrit.wikimedia.org/r/1171148 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [09:42:03] (03PS1) 10Giuseppe Lavagetto: requestctl_client: fail on missing api token defs [puppet] - 10https://gerrit.wikimedia.org/r/1171164 [09:42:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1194.eqiad.wmnet with reason: Maintenance [09:42:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1194 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79499 and previous config saved to /var/cache/conftool/dbconfig/20250721-094221-marostegui.json [09:44:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:44:05] (03CR) 10Ayounsi: "LGTM, thx, to be deployed only once all the active "notify" are solved." [puppet] - 10https://gerrit.wikimedia.org/r/1171164 (owner: 10Giuseppe Lavagetto) [09:45:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:45:10] (03CR) 10Ayounsi: [C:03+1] "sounds good!" [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 (owner: 10Volans) [09:45:17] (03CR) 10CI reject: [V:04-1] requestctl_client: fail on missing api token defs [puppet] - 10https://gerrit.wikimedia.org/r/1171164 (owner: 10Giuseppe Lavagetto) [09:46:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79500 and previous config saved to /var/cache/conftool/dbconfig/20250721-094636-root.json [09:48:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:48:28] (03PS2) 10Giuseppe Lavagetto: requestctl_client: fail on missing api token defs [puppet] - 10https://gerrit.wikimedia.org/r/1171164 [09:48:54] (03PS1) 10Marostegui: db2182: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171165 (https://phabricator.wikimedia.org/T399955) [09:49:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:49:21] (03PS1) 10Elukey: admin_ng: increase allowed CPUs for kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171166 [09:49:37] (03CR) 10Volans: "Thanks for the patch! The approach looks good, quick question inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [09:50:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79501 and previous config saved to /var/cache/conftool/dbconfig/20250721-095001-root.json [09:50:03] (03CR) 10Marostegui: [C:03+2] db2182: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171165 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [09:50:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:50:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79502 and previous config saved to /var/cache/conftool/dbconfig/20250721-095009-root.json [09:51:08] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2182.codfw.wmnet with reason: Maintenance [09:51:10] (03CR) 10Ayounsi: [C:03+1] requestctl_client: fail on missing api token defs [puppet] - 10https://gerrit.wikimedia.org/r/1171164 (owner: 10Giuseppe Lavagetto) [09:51:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2182 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79503 and previous config saved to /var/cache/conftool/dbconfig/20250721-095112-marostegui.json [09:52:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:52:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:52:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:52:55] (03CR) 10Giuseppe Lavagetto: [C:03+2] requestctl_client: fail on missing api token defs [puppet] - 10https://gerrit.wikimedia.org/r/1171164 (owner: 10Giuseppe Lavagetto) [09:53:29] (03CR) 10CI reject: [V:04-1] setup.py: pin prospector [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 (owner: 10Volans) [09:55:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:55:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:55:19] (03PS4) 10Volans: setup.py: pin prospector [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 [09:57:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:57:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:57:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:59:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79504 and previous config saved to /var/cache/conftool/dbconfig/20250721-095918-root.json [09:59:39] (03PS1) 10Samtar: IS/IS-labs: Initial state of wgTemplateDataEnableFeaturedTemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171168 (https://phabricator.wikimedia.org/T391064) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T1000) [10:00:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:00:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:00:17] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [10:00:50] (03CR) 10Elukey: [C:03+2] admin_ng: increase allowed CPUs for kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171166 (owner: 10Elukey) [10:01:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:01:05] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:02:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:02:11] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [10:03:21] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [10:03:41] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [10:04:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:04:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:04:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T399249)', diff saved to https://phabricator.wikimedia.org/P79505 and previous config saved to /var/cache/conftool/dbconfig/20250721-100418-marostegui.json [10:04:23] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:05:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79506 and previous config saved to /var/cache/conftool/dbconfig/20250721-100508-root.json [10:05:16] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [10:07:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:07:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:12:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:13:06] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [10:13:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:14:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:14:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79507 and previous config saved to /var/cache/conftool/dbconfig/20250721-101426-root.json [10:15:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:15:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:16:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:17:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:18:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:19:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:20:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79508 and previous config saved to /var/cache/conftool/dbconfig/20250721-102014-root.json [10:24:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:26:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:29:01] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] IS/IS-labs: Initial state of wgTemplateDataEnableFeaturedTemplates (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171168 (https://phabricator.wikimedia.org/T391064) (owner: 10Samtar) [10:29:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79509 and previous config saved to /var/cache/conftool/dbconfig/20250721-102934-root.json [10:30:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:30:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:31:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:31:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:31:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:31:59] jouncebot: nowandnext [10:31:59] For the next 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T1000) [10:31:59] In 2 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T1300) [10:33:13] (03PS1) 10Tiziano Fogli: prom/metamonitor: add listen_port to public_endpoint vhost template [puppet] - 10https://gerrit.wikimedia.org/r/1171170 (https://phabricator.wikimedia.org/T397003) [10:33:27] (03PS1) 10Tiziano Fogli: prom/metamonitor: force gunicorn to log to a file [puppet] - 10https://gerrit.wikimedia.org/r/1171171 (https://phabricator.wikimedia.org/T397003) [10:33:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171168 (https://phabricator.wikimedia.org/T391064) (owner: 10Samtar) [10:34:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:34:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:34:27] (03Merged) 10jenkins-bot: IS/IS-labs: Initial state of wgTemplateDataEnableFeaturedTemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171168 (https://phabricator.wikimedia.org/T391064) (owner: 10Samtar) [10:34:41] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1171168|IS/IS-labs: Initial state of wgTemplateDataEnableFeaturedTemplates (T391064)]] [10:34:45] T391064: Enable template favoriting on all remaining WMF wikis - https://phabricator.wikimedia.org/T391064 [10:35:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79510 and previous config saved to /var/cache/conftool/dbconfig/20250721-103520-root.json [10:36:37] !log samtar@deploy1003 samtar: Backport for [[gerrit:1171168|IS/IS-labs: Initial state of wgTemplateDataEnableFeaturedTemplates (T391064)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:37:40] FIRING: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:37:47] * TheresNoTime testing ^ [10:39:05] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:42:40] RESOLVED: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:42:47] !log samtar@deploy1003 samtar: Continuing with sync [10:42:55] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:43:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:43:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:44:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79511 and previous config saved to /var/cache/conftool/dbconfig/20250721-104439-root.json [10:45:23] (03PS1) 10Btullis: Add the new an-worker nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1171173 (https://phabricator.wikimedia.org/T399964) [10:46:05] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:47:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:47:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:49:09] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:50:01] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171168|IS/IS-labs: Initial state of wgTemplateDataEnableFeaturedTemplates (T391064)]] (duration: 15m 20s) [10:50:06] T391064: Enable template favoriting on all remaining WMF wikis - https://phabricator.wikimedia.org/T391064 [10:52:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:53:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:53:55] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:54:09] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:55:24] (03PS1) 10Samtar: IS: Set wgTemplateDataEnableFeaturedTemplates default true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171175 (https://phabricator.wikimedia.org/T391064) [10:56:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:57:05] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:59:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:59:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:00:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:00:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:03:00] (03PS1) 10Fabfur: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) [11:04:51] (03CR) 10CI reject: [V:04-1] traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [11:05:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:07:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:07:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:08:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:10:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:11:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:12:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:13:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:15:10] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:20:07] (03PS2) 10Fabfur: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) [11:20:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:21:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:21:35] (03CR) 10CI reject: [V:04-1] traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [11:22:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:23:09] (03PS1) 10Marostegui: s4 eqiad: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171178 [11:23:49] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1171178 (owner: 10Marostegui) [11:24:18] !log move doh7003 (insetup) to ganeti7002 [11:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:25:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:25:24] (03CR) 10Marostegui: [C:03+2] s4 eqiad: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171178 (owner: 10Marostegui) [11:27:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:27:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:28:55] (03PS1) 10Marostegui: s4 codfw: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171179 [11:29:58] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11020033 (10jcrespo) Disk 13 has appeared: "Online, Spun Up". @Jclark-ctr did you do something extra aside from the reboot? (not complaining, just trying to figure out why it is now good!) [11:30:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T399249)', diff saved to https://phabricator.wikimedia.org/P79512 and previous config saved to /var/cache/conftool/dbconfig/20250721-113004-marostegui.json [11:30:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:30:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:30:10] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:30:43] !log depool and move ncredir7003 to ganeti7003 [11:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:03] (03CR) 10Marostegui: "Noop" [puppet] - 10https://gerrit.wikimedia.org/r/1171179 (owner: 10Marostegui) [11:31:04] (03CR) 10Marostegui: [C:03+2] s4 codfw: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171179 (owner: 10Marostegui) [11:31:55] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:32:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:33:01] (03PS1) 10Marostegui: db1248: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1171180 (https://phabricator.wikimedia.org/T388837) [11:33:50] (03CR) 10Marostegui: [C:03+2] db1248: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1171180 (https://phabricator.wikimedia.org/T388837) (owner: 10Marostegui) [11:35:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:35:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:36:40] RESOLVED: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:37:11] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11020045 (10Jclark-ctr) 05Open→03Resolved [11:37:46] (03PS1) 10Marostegui: db2236: Remove comment [puppet] - 10https://gerrit.wikimedia.org/r/1171181 [11:38:40] (03CR) 10Stevemunene: [C:03+1] Add the new an-worker nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1171173 (https://phabricator.wikimedia.org/T399964) (owner: 10Btullis) [11:39:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11020047 (10jcrespo) For posterity (clarifying what happened), the originally swapped disk was bad, and @Jclark-ctr hot-swapped it Friday afternoon. [11:39:39] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1171181 (owner: 10Marostegui) [11:39:40] (03CR) 10Marostegui: [C:03+2] db2236: Remove comment [puppet] - 10https://gerrit.wikimedia.org/r/1171181 (owner: 10Marostegui) [11:40:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:40:09] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:40:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:43:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:44:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:44:09] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:45:10] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:45:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P79513 and previous config saved to /var/cache/conftool/dbconfig/20250721-114511-marostegui.json [11:47:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:47:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:50:01] !log depool ncredir7004 for ganeti7002 bird upgrade [11:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:50:09] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:50:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:52:03] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11020077 (10Jclark-ctr) a:03Jclark-ctr [11:53:38] (03PS1) 10Cathal Mooney: Update Kubernetes reverse DNS PTR delegations [dns] - 10https://gerrit.wikimedia.org/r/1171182 (https://phabricator.wikimedia.org/T310169) [11:53:49] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for backup1007.eqiad.wmnet [11:53:50] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for backup1007.eqiad.wmnet [11:54:37] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11020083 (10Jclark-ctr) @BTullis we have another failed drive on an-worker. Dell ticket is in progress for replacement drive [11:55:54] (03PS1) 10Marostegui: s7 codfw: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171183 [11:55:59] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [11:56:00] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:56:14] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [11:56:25] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:56:35] (03CR) 10Btullis: [C:03+2] Add the new an-worker nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1171173 (https://phabricator.wikimedia.org/T399964) (owner: 10Btullis) [11:56:39] (03CR) 10Marostegui: "NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1171183 (owner: 10Marostegui) [11:56:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:56:55] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:56:57] (03CR) 10Marostegui: [C:03+2] s7 codfw: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171183 (owner: 10Marostegui) [11:57:18] btullis: good to merge? [11:57:53] !log ayounsi@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir7004.magru.wmnet [11:58:06] (03PS1) 10Elukey: role::puppetmaster::frontend: remove requestctl client [puppet] - 10https://gerrit.wikimedia.org/r/1171184 [11:59:19] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6347/console" [puppet] - 10https://gerrit.wikimedia.org/r/1171184 (owner: 10Elukey) [11:59:25] (03PS2) 10Cathal Mooney: Update Kubernetes reverse DNS PTR delegations [dns] - 10https://gerrit.wikimedia.org/r/1171182 (https://phabricator.wikimedia.org/T376291) [12:00:09] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:00:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:00:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P79514 and previous config saved to /var/cache/conftool/dbconfig/20250721-120019-marostegui.json [12:02:14] (03PS3) 10Cathal Mooney: Update Kubernetes reverse DNS PTR delegations [dns] - 10https://gerrit.wikimedia.org/r/1171182 (https://phabricator.wikimedia.org/T376291) [12:03:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:03:15] (03PS1) 10Marostegui: s7 eqiad: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171185 [12:03:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.411s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:04:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:08:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:09:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:13:00] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11020134 (10cmooney) Still flapping, have asked for an update. [12:14:12] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:14:34] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:15:06] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:15:18] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:15:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T399249)', diff saved to https://phabricator.wikimedia.org/P79515 and previous config saved to /var/cache/conftool/dbconfig/20250721-121526-marostegui.json [12:15:34] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:15:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:15:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T399249)', diff saved to https://phabricator.wikimedia.org/P79516 and previous config saved to /var/cache/conftool/dbconfig/20250721-121549-marostegui.json [12:16:51] (03PS3) 10Fabfur: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) [12:18:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.09s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:18:25] (03CR) 10CI reject: [V:04-1] traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [12:19:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:19:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:19:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:19:55] RESOLVED: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:20:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:20:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.273s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:23:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:24:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:24:05] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:24:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:24:55] RESOLVED: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:25:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.273s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:27:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:27:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:28:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:28:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:29:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:30:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.338s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:31:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:31:10] (03CR) 10Marostegui: [C:03+2] s7 eqiad: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171185 (owner: 10Marostegui) [12:32:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:34:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:34:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:34:55] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:35:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:39:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:39:55] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:39:56] !log deploy CR1169662 to test and magru routed ganeti [12:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:41:23] (03CR) 10Ayounsi: [C:03+2] Ganeti Bird BGP [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [12:42:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:42:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:42:22] (03PS4) 10Arthur taylor: Enable wbui2025 mobile user interface on Wikidata Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703) [12:42:46] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11020181 (10Jhancock.wm) Are we doing these one at a time as well? We can start scheduling them this week. I plan to be onsite every day this week. [12:43:49] * Lucas_WMDE looks at the deployment calendar [12:44:05] not sure I feel brave enough to move all those wikis out of the wikipedia dblist 😅 [12:44:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:45:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.632s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:45:55] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:49:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:51:03] ok, the diffConfig looks encouraging at least [12:52:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:52:38] FIRING: GoRoutinesTooHigh: gNMIc running on netflow2003 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [12:53:38] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Add a test to verify that "normal" DBLists contain only SUL wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [12:54:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:54:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:55:30] Yeah, I tried to minimize the diffConfig to ease reviewer anxiety ;) [12:56:00] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Should be okay to deploy now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703) (owner: 10Arthur taylor) [12:56:13] (03PS6) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists contain only SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) [12:56:23] (03CR) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists contain only SUL wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [12:57:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:57:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:57:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703) (owner: 10Arthur taylor) [12:58:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:59:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:59:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:59:52] that’s quite a bit of logspam in logspam-watch :S [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T1300). [13:00:05] Daimona and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:39] o/ [13:01:06] I would say let’s start with “Explicitly set wgServer etc.” separately and include my beta-only change in that [13:01:15] and then the rest of Daimona’s changes… probably also all individually tbh [13:01:20] not sure we’ll get through all of them in the window [13:01:34] but I think it’s safer to not deploy them together [13:02:06] Makes sense! [13:02:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T183549) (owner: 10Jforrester) [13:02:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703) (owner: 10Arthur taylor) [13:03:29] (03PS6) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [13:03:35] (03Merged) 10jenkins-bot: Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T183549) (owner: 10Jforrester) [13:03:38] (03Merged) 10jenkins-bot: Enable wbui2025 mobile user interface on Wikidata Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703) (owner: 10Arthur taylor) [13:03:45] (03CR) 10Fabfur: haproxy: script to perform configuration validation (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [13:03:52] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1167900|Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist (T183549)]], [[gerrit:1170304|Enable wbui2025 mobile user interface on Wikidata Beta (T399703)]] [13:03:58] (03CR) 10CI reject: [V:04-1] haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [13:03:59] T183549: Move non-Wikipedia wikis out of the ‘wikipedia’ dblist and into the ‘special’ dblist - https://phabricator.wikimedia.org/T183549 [13:03:59] T399703: [MEX] Release under feature flag on beta wikidata - https://phabricator.wikimedia.org/T399703 [13:05:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:23] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable wbui2025 mobile user interface on Wikidata Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703) (owner: 10Arthur taylor) [13:05:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:05:49] !log lucaswerkmeister-wmde@deploy1003 jforrester, arthurtaylor, lucaswerkmeister-wmde: Backport for [[gerrit:1167900|Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist (T183549)]], [[gerrit:1170304|Enable wbui2025 mobile user interface on Wikidata Beta (T399703)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:06:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:06:41] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:06:50] Daimona: please test ^^ [13:06:53] (that nothing changed yet, I guess) [13:06:54] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:08:40] I can confirm that my config change appears to have no effect on real and test Wikidata yet (which is the expected behavior) [13:09:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:09:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:09:16] (03PS1) 10Phuedx: mw::maintenance: ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) [13:09:18] Looks good AFAICT [13:10:18] !log lucaswerkmeister-wmde@deploy1003 jforrester, arthurtaylor, lucaswerkmeister-wmde: Continuing with sync [13:10:20] ok, thanks! [13:10:34] re “that’s quite a bit of logspam in logspam-watch :S” – reported at T400055 [13:10:35] T400055: PHP Warning: Undefined array key "DEFAULT" - https://phabricator.wikimedia.org/T400055 [13:10:40] RESOLVED: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:11:55] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:12:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:12:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:12:12] (03CR) 10Phuedx: [C:04-1] "This'll need to wait until I845f5d8f727f5b2ddfcf4dd7fae256bb1c12ec6d is deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [13:13:07] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:13:46] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:14:20] * Lucas_WMDE isn’t used to T4* task IDs yet [13:15:43] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167900|Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist (T183549)]], [[gerrit:1170304|Enable wbui2025 mobile user interface on Wikidata Beta (T399703)]] (duration: 11m 50s) [13:15:49] T183549: Move non-Wikipedia wikis out of the ‘wikipedia’ dblist and into the ‘special’ dblist - https://phabricator.wikimedia.org/T183549 [13:15:49] T399703: [MEX] Release under feature flag on beta wikidata - https://phabricator.wikimedia.org/T399703 [13:16:00] ok [13:16:05] next up: the big ’un [13:16:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11020279 (10ssingh) 05Open→03Resolved Please re-open if there are any issues. [13:16:32] :ablobfoxbongoterrified: [13:16:55] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:17:15] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11020282 (10REsquito-WMF) Thanks @ssingh ! i'll leave feedback once i test it today. [13:17:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [13:17:50] (03PS1) 10Stevemunene: druid: Add new an-druid100[67] to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1171207 (https://phabricator.wikimedia.org/T397440) [13:17:51] (03PS1) 10Stevemunene: zookeeper: Add an-druid100[45] to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1171208 (https://phabricator.wikimedia.org/T397440) [13:17:57] (03PS1) 10Stevemunene: turnilo: replace turnilo druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/1171209 (https://phabricator.wikimedia.org/T397440) [13:18:10] (03Merged) 10jenkins-bot: Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [13:18:22] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1167880|Move special wikis outside of the 'wikipedia' group (T183549)]] [13:19:04] (03PS7) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [13:19:53] * Lucas_WMDE has a `scap backport --revert` command ready to go in SSH if needed [13:20:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:20:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:20:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:20:20] !log lucaswerkmeister-wmde@deploy1003 daimona, lucaswerkmeister-wmde: Backport for [[gerrit:1167880|Move special wikis outside of the 'wikipedia' group (T183549)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:21:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:21:39] Daimona: can you test? (idk if you’re in any special committees ^^) [13:21:47] (also, nice try, that link on your user page almost got me :P) [13:21:55] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:22:59] (03CR) 10Filippo Giunchedi: [C:03+1] prom/metamonitor: add listen_port to public_endpoint vhost template [puppet] - 10https://gerrit.wikimedia.org/r/1171170 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [13:23:05] I can, but I don't really know what to look for, besides obvious things such as "the wikis caught fire" [13:23:11] (what link?) [13:23:30] (03CR) 10Filippo Giunchedi: [C:03+1] prom/metamonitor: force gunicorn to log to a file [puppet] - 10https://gerrit.wikimedia.org/r/1171171 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [13:23:51] the one in “About me” ^^ [13:23:53] (second one) [13:24:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:24:10] yeah I’m not sure what to look for either [13:24:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:25:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:25:17] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:25:19] https://arbcom-en.wikipedia.org/wiki/Main_Page doesn’t look particularly on fire, at least [13:25:24] has a working logo too [13:25:32] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:25:50] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11020293 (10Marostegui) Yeah, we'll do one at the time, to be on the safe side. Do you want me to get es2035 ready for tomorrow? [13:26:14] https://arbcom-fi.wikipedia.org/ shows the wikipedia logo, both with and without WikimediaDebug, as per the TODO in the change [13:26:19] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11020296 (10Jclark-ctr) Confirmed: Service Request 213125439 [13:26:55] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:26:57] Oh yeah, but that's a really interesting talk, isn't it? [13:27:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:27:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:27:07] I tested a bunch of stuff on sysop-it and everything looks normal [13:27:21] ok, nice [13:27:26] then let’s try it [13:27:44] !log lucaswerkmeister-wmde@deploy1003 daimona, lucaswerkmeister-wmde: Continuing with sync [13:27:55] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:28:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:30:10] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:30:35] In logstash there's a warning that seemed related at first glance: "This model is not recommended for use in projects outside of Wikipedia". But there's a high volume of these for the last 15 days at least, so it's not actually related. [13:30:44] ok [13:30:47] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11020304 (10Jhancock.wm) yeah that would work for me, ty! [13:31:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:31:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:31:08] Daimona: I’m pondering which change to do next… I’d be inclined to do the phan duplicate array keys one, both because I like it in general and to leave a bit more time where special wiki users can shout if anything went wrong [13:31:10] (03CR) 10Ssingh: [C:03+1] "The auto-generated bit really sealed it 😊" [dns] - 10https://gerrit.wikimedia.org/r/1171182 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [13:31:23] then probably add the test for SUL-ness, and then do “clean up some settings” last [13:31:26] how does that sound [13:31:29] *? [13:31:32] Sure! [13:32:03] and it should be okay if we overrun a little bit, there’s a half-hour break before the xLab window [13:32:55] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:33:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:33:10] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167880|Move special wikis outside of the 'wikipedia' group (T183549)]] (duration: 14m 48s) [13:33:14] T183549: Move non-Wikipedia wikis out of the ‘wikipedia’ dblist and into the ‘special’ dblist - https://phabricator.wikimedia.org/T183549 [13:33:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [13:34:02] (03CR) 10CI reject: [V:04-1] Add phan and use it to detect duplicated array keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [13:34:04] lol, the right-to-left arrows in the middle of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1167941/2/wmf-config/core-Namespaces.php … [13:34:08] (Also, the warning I previously mentioned comes mostly from wikitech, incubatorwiki, wikifunctionswiki et al) [13:34:42] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:34:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:54] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:35:40] oops, merge conflict apparently [13:35:53] Ah yes, that part is written in ꟼHꟼ [13:36:24] :D [13:36:39] Merge conflict makes sense, I noticed a duplicate key while doing the other patch. Lemme fix it quickly. [13:36:43] (03PS3) 10Lucas Werkmeister (WMDE): Add phan and use it to detect duplicated array keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [13:36:47] already did [13:36:47] (03CR) 10Ssingh: [C:03+1] "Looks good -- a PCC output once we get doh7003 will be handy but no issues with testing it manually there as well." [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [13:37:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:37:05] Oh nice, thank you [13:37:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:37:25] Daimona: https://bash.toolforge.org/quip/jh00LZgBvg159pQrCh5c ^^ [13:37:28] anyway [13:38:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [13:39:04] (03Merged) 10jenkins-bot: Add phan and use it to detect duplicated array keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [13:39:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:39:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:39:17] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1167941|Add phan and use it to detect duplicated array keys]] [13:39:44] Well :) [13:39:55] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:40:14] (03PS1) 10Ssingh: site.pp: remove doh7003 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1171212 (https://phabricator.wikimedia.org/T362392) [13:40:31] (03PS2) 10Ssingh: site.pp: remove doh7003 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1171212 (https://phabricator.wikimedia.org/T362392) [13:40:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:41:14] !log lucaswerkmeister-wmde@deploy1003 daimona, lucaswerkmeister-wmde: Backport for [[gerrit:1167941|Add phan and use it to detect duplicated array keys]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:42:10] https://github.com/lucaswerkmeister/home/blob/main/.bashrc.d/wikimedia-debug-diff reports no effective changes on {bh,fa,ne,ar}.wikipedia.org, yay [13:42:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T399249)', diff saved to https://phabricator.wikimedia.org/P79517 and previous config saved to /var/cache/conftool/dbconfig/20250721-134227-marostegui.json [13:42:37] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:43:18] Daimona: do you want to test anything else? [13:43:39] (03CR) 10Ayounsi: [C:03+1] site.pp: remove doh7003 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1171212 (https://phabricator.wikimedia.org/T362392) (owner: 10Ssingh) [13:43:56] (03CR) 10Ssingh: [C:03+2] site.pp: remove doh7003 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1171212 (https://phabricator.wikimedia.org/T362392) (owner: 10Ssingh) [13:44:02] I think we're good [13:44:07] !log lucaswerkmeister-wmde@deploy1003 daimona, lucaswerkmeister-wmde: Continuing with sync [13:44:47] (03CR) 10Cathal Mooney: [C:03+2] Update Kubernetes reverse DNS PTR delegations [dns] - 10https://gerrit.wikimedia.org/r/1171182 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [13:44:55] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:45:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:45:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:45:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:45:50] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm [13:46:17] !log cmooney@dns2005 START - running authdns-update [13:47:04] !log cmooney@dns2005 END - running authdns-update [13:48:09] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:48:40] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs://wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/ using stat1009.eqiad.wmnet) [13:48:44] 06SRE, 06Infrastructure-Foundations, 10netops: BGP: Support receipt of graceful-shutdown community and set local-pref - https://phabricator.wikimedia.org/T399931#11020341 (10cmooney) 05Open→03Resolved a:03cmooney [13:48:45] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs://wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/ using stat1009.eqiad.wmnet) [13:49:20] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167941|Add phan and use it to detect duplicated array keys]] (duration: 10m 03s) [13:49:24] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs://wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/ using stat1009.eqiad.wmnet) [13:49:28] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs://wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/ using stat1009.eqiad.wmnet) [13:49:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [13:49:41] I wonder if scap will even sync this one [13:49:44] (03CR) 10CI reject: [V:04-1] Add a test to verify that "normal" DBLists contain only SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [13:49:51] buh, needs rebase? [13:49:55] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:49:58] (03PS7) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists contain only SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) [13:50:04] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:50:05] Welp [13:50:25] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/ using stat1009.eqiad.wmnet) [13:50:40] why o_O [13:50:43] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [13:50:51] no actual diff in PS6..7 … [13:50:55] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:51:16] I thought after all these years of gerrit you'd know you're not supposed to ask these questions :D [13:51:31] (03Merged) 10jenkins-bot: Add a test to verify that "normal" DBLists contain only SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [13:51:45] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1167890|Add a test to verify that "normal" DBLists contain only SUL wikis (T183549)]] [13:51:53] T183549: Move non-Wikipedia wikis out of the ‘wikipedia’ dblist and into the ‘special’ dblist - https://phabricator.wikimedia.org/T183549 [13:52:12] looks like scap is syncing after all, ok [13:52:35] (03PS3) 10Daimona Eaytoy: Clean up some settings for special wikis no longer in wikipedia group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168169 (https://phabricator.wikimedia.org/T183549) [13:53:07] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/ using stat1009.eqiad.wmnet) [13:53:14] Nice :) [13:53:41] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1167890|Add a test to verify that "normal" DBLists contain only SUL wikis (T183549)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:53:48] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) [13:53:49] (https://gerrit.wikimedia.org/g/operations/puppet/+/4793573eca97409db86c07046ed4b34f3e5fadf2/modules/scap/templates/scap.cfg.erb#122 is the config that specifies where syncs are skipped, and it’s only “beta-only”, not “not production”, so tests/ isn’t included ^^) [13:54:24] (03CR) 10Cathal Mooney: "btullis: this approach is easier from an automation point of view for moving forward. it changes the IPv6 BGP neighbor from the global ad" [puppet] - 10https://gerrit.wikimedia.org/r/1170543 (owner: 10Cathal Mooney) [13:54:26] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, daimona: Continuing with sync [13:54:52] (03CR) 10Tiziano Fogli: [C:03+2] prom/metamonitor: add listen_port to public_endpoint vhost template [puppet] - 10https://gerrit.wikimedia.org/r/1171170 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [13:55:02] (03CR) 10Tiziano Fogli: [C:03+2] prom/metamonitor: force gunicorn to log to a file [puppet] - 10https://gerrit.wikimedia.org/r/1171171 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [13:55:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:55:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:55:55] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:57:08] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:57:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P79518 and previous config saved to /var/cache/conftool/dbconfig/20250721-135735-marostegui.json [13:58:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:58:08] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:59:46] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167890|Add a test to verify that "normal" DBLists contain only SUL wikis (T183549)]] (duration: 08m 01s) [13:59:51] T183549: Move non-Wikipedia wikis out of the ‘wikipedia’ dblist and into the ‘special’ dblist - https://phabricator.wikimedia.org/T183549 [14:00:40] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:00:42] Thanks Lucas! I'll reschedule the remaining patches for tomorrow. [14:00:42] (03PS10) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [14:00:48] ok! [14:00:59] FWIW I’d be fine with continuing deploying now [14:01:04] but if you have to leave or something, that’s also fine ^^ [14:01:10] there’s just one patch left, right? [14:01:16] (at least of the ones scheduled for today) [14:01:32] Oh yeah I'm still here, so you can go ahead if that works for you [14:01:37] (03PS1) 10Stevemunene: Add keytabs for new an-druid100[67] hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1171214 (https://phabricator.wikimedia.org/T397440) [14:01:40] jouncebot: nowandnext [14:01:40] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [14:01:41] In 0 hour(s) and 28 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T1430) [14:01:45] let’s do it then, should be enough time [14:01:45] Yeah, there's one more patch that I forgot to schedule for today [14:01:54] So I'll need to reschedule that one anyway [14:01:59] ah ok [14:02:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168169 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [14:02:19] place your bets, merge conflict or not [14:02:26] Nay, I rebased it already [14:03:13] (03Merged) 10jenkins-bot: Clean up some settings for special wikis no longer in wikipedia group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168169 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [14:03:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy) [14:03:26] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1168169|Clean up some settings for special wikis no longer in wikipedia group (T183549)]] [14:04:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:04:08] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:04:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:04:52] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T400061 (10phaultfinder) 03NEW [14:05:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:05:08] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:05:20] (03PS11) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [14:05:23] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1168169|Clean up some settings for special wikis no longer in wikipedia group (T183549)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:05:28] T183549: Move non-Wikipedia wikis out of the ‘wikipedia’ dblist and into the ‘special’ dblist - https://phabricator.wikimedia.org/T183549 [14:06:13] Daimona: anything to test here? [14:07:12] I’m checking if any Wikidata items even have sitelinks to any of these wikis ^^ [14:07:26] I think just the usual "make sure nothing broke" [14:08:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:08:06] ok, no sitelinks for any of these wikis. so I believe the wmgWikibaseSiteGroup part is indeed a no-op [14:08:08] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:08:47] Everything seems normal AFAICT [14:09:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:09:35] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, daimona: Continuing with sync [14:09:37] ok, then let’s sync [14:10:06] I tried to see where the wgRightsUrl is used but couldn’t find it (it doesn’t seem to be in the footer – I guess that comes from WikimediaMessages) [14:11:03] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T398014 [14:11:08] T398014: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T398014 [14:11:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2212 with weight 0 T398014', diff saved to https://phabricator.wikimedia.org/P79519 and previous config saved to /var/cache/conftool/dbconfig/20250721-141126-fceratto.json [14:12:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:12:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P79520 and previous config saved to /var/cache/conftool/dbconfig/20250721-141242-marostegui.json [14:14:00] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage [14:14:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:15:00] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1168169|Clean up some settings for special wikis no longer in wikipedia group (T183549)]] (duration: 11m 34s) [14:15:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:15:06] T183549: Move non-Wikipedia wikis out of the ‘wikipedia’ dblist and into the ‘special’ dblist - https://phabricator.wikimedia.org/T183549 [14:15:08] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:15:20] !log UTC afternoon backport+config window done [14:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:38] I should be around for at least 90 more minutes or so, so feel free to ping me if the changes need reverting [14:16:16] (03CR) 10Btullis: [C:03+1] druid: Add new an-druid100[67] to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1171207 (https://phabricator.wikimedia.org/T397440) (owner: 10Stevemunene) [14:16:40] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1164390 (https://phabricator.wikimedia.org/T398014) (owner: 10Gerrit maintenance bot) [14:16:42] (03CR) 10Btullis: [C:03+1] zookeeper: Add an-druid100[45] to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1171208 (https://phabricator.wikimedia.org/T397440) (owner: 10Stevemunene) [14:17:08] (03CR) 10Btullis: [C:03+1] turnilo: replace turnilo druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/1171209 (https://phabricator.wikimedia.org/T397440) (owner: 10Stevemunene) [14:17:24] (03CR) 10Ayounsi: cephosd: un-set bird bgp neighbors rather than override for each host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170543 (owner: 10Cathal Mooney) [14:18:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:18:06] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage [14:18:40] !log Starting s1 codfw failover from db2203 to db2212 - T398014 [14:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:44] T398014: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T398014 [14:18:50] (03CR) 10Vgutierrez: [C:04-1] "* probably these tests don't fit on modules/haproxy and should need to be on modules/profile/files/cache/" [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [14:19:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:19:51] (03PS1) 10Marostegui: s8 codfw: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171219 [14:19:53] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye [14:20:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2212 to s1 primary T398014', diff saved to https://phabricator.wikimedia.org/P79521 and previous config saved to /var/cache/conftool/dbconfig/20250721-142011-fceratto.json [14:20:29] (03PS2) 10Marostegui: s8 codfw: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171219 [14:20:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:20:43] (03PS3) 10Marostegui: s8 codfw: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171219 [14:20:55] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:21:21] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1171219 (owner: 10Marostegui) [14:21:28] (03CR) 10Vgutierrez: [C:04-1] haproxy: script to perform configuration validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [14:21:33] (03CR) 10Marostegui: [C:03+2] s8 codfw: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171219 (owner: 10Marostegui) [14:24:09] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:27:33] 06SRE, 10vm-requests: eqiad: VMs requested for Data Persistence automation and testbeds - https://phabricator.wikimedia.org/T390087#11020462 (10FCeratto-WMF) [14:27:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T399249)', diff saved to https://phabricator.wikimedia.org/P79523 and previous config saved to /var/cache/conftool/dbconfig/20250721-142749-marostegui.json [14:27:55] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:28:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:28:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T399249)', diff saved to https://phabricator.wikimedia.org/P79524 and previous config saved to /var/cache/conftool/dbconfig/20250721-142811-marostegui.json [14:29:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T1430) [14:33:56] (03CR) 10Ssingh: "We can look at the alert rules in detail and why the test doesn't align up but IMO, as a first step, you should remove the explicit [WARNI" [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [14:34:09] (03PS1) 10Btullis: Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 [14:36:34] (03CR) 10CI reject: [V:04-1] Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (owner: 10Btullis) [14:36:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance [14:36:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T395241)', diff saved to https://phabricator.wikimedia.org/P79525 and previous config saved to /var/cache/conftool/dbconfig/20250721-143658-fceratto.json [14:37:12] (03PS1) 10Marostegui: s8 eqiad: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171224 [14:37:23] (03CR) 10Marostegui: "Noop" [puppet] - 10https://gerrit.wikimedia.org/r/1171224 (owner: 10Marostegui) [14:37:50] (03CR) 10Marostegui: [C:03+2] s8 eqiad: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171224 (owner: 10Marostegui) [14:38:56] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm [14:39:05] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:40:49] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:42:18] (03PS2) 10Btullis: Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 [14:42:46] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) [14:42:59] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) [14:43:45] (03PS1) 10Kimberly Sarabia: New experiment name for page-visited event [extensions/WikimediaEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171226 (https://phabricator.wikimedia.org/T399227) [14:43:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T395241)', diff saved to https://phabricator.wikimedia.org/P79526 and previous config saved to /var/cache/conftool/dbconfig/20250721-144358-fceratto.json [14:44:35] (03CR) 10Ssingh: [C:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [14:44:36] (03CR) 10CI reject: [V:04-1] Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (owner: 10Btullis) [14:46:38] (03PS12) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [14:46:47] 10SRE-SLO, 10Observability-Metrics: Clear & Backfill Tonecheck Pyrra Metrics - https://phabricator.wikimedia.org/T400071 (10herron) 03NEW [14:47:38] (03CR) 10Elukey: [V:03+1 C:03+2] role::puppetmaster::frontend: remove requestctl client [puppet] - 10https://gerrit.wikimedia.org/r/1171184 (owner: 10Elukey) [14:48:56] (03CR) 10Kimberly Sarabia: "This is the cherry pick we want to backport today in order to turn on the experiment tomorrow." [extensions/WikimediaEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171226 (https://phabricator.wikimedia.org/T399227) (owner: 10Kimberly Sarabia) [14:49:08] (03CR) 10Ssingh: [C:03+1] "I am guessing the facts for doh7003 haven't been updated yet, otherwise it should match facts['netmask']." [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [14:49:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171226 (https://phabricator.wikimedia.org/T399227) (owner: 10Kimberly Sarabia) [14:50:00] (03PS3) 10Btullis: Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 [14:50:40] 10SRE-SLO, 10Observability-Metrics: Clear & Backfill citoid Pyrra Metrics - https://phabricator.wikimedia.org/T400073 (10herron) 03NEW [14:52:05] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (owner: 10Btullis) [14:52:21] (03CR) 10CI reject: [V:04-1] Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (owner: 10Btullis) [14:53:32] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [14:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:54:24] (03PS4) 10Btullis: Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 [14:55:58] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (owner: 10Btullis) [14:59:26] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:59:30] (03PS8) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [14:59:41] (03PS5) 10Btullis: Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 [15:01:20] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (owner: 10Btullis) [15:03:39] (03CR) 10Fabfur: "* Moved to the profile section" [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [15:04:26] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:06:27] (03PS1) 10Jforrester: Provide a repo-mode pair of parser functions for showing label/description [extensions/WikiLambda] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171228 [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:17] (03PS13) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [15:08:17] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6366/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:08:59] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bookworm [15:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:32] (03PS4) 10Fabfur: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) [15:09:45] (03CR) 10Fabfur: "There's no actual reason why we should keep these, I've started from the existing haproxykafka alert configuration from data-engineering w" [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [15:10:10] 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11020759 (10Eevans) Ok, so in an attempt to summarize things: [x] It seems that goal no. 1 is complete, all partman preseeds have been updated [] Goal no. 3 might need to be revisited with... [15:10:43] (03CR) 10CI reject: [V:04-1] traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [15:11:02] (03PS3) 10Ayounsi: WIP: Bird: VM side - add support for Routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) [15:11:10] (03PS6) 10Btullis: Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 [15:11:16] (03PS4) 10Ayounsi: WIP: Bird: VM side - add support for Routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) [15:11:34] (03PS5) 10Ayounsi: Bird: VM side - add support for Routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) [15:11:59] (03CR) 10JHathaway: [C:03+2] reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [15:11:59] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:12:49] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:13:04] (03PS14) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [15:13:14] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (owner: 10Btullis) [15:13:45] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6368/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:14:04] (03CR) 10JHathaway: [C:03+2] reimage: use ipxe DHCP info, skip d-i DHCP [puppet] - 10https://gerrit.wikimedia.org/r/1167883 (owner: 10JHathaway) [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:58] (03PS15) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [15:18:41] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6369/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:19:19] RESOLVED: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:20:24] (03PS7) 10Btullis: Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 [15:20:38] (03PS16) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [15:21:20] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6371/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:22:40] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (owner: 10Btullis) [15:23:25] (03PS17) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [15:24:05] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6372/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:25:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170376 (https://phabricator.wikimedia.org/T399755) (owner: 10Jforrester) [15:25:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171228 (owner: 10Jforrester) [15:26:00] 06SRE, 06Release-Engineering-Team, 10Data-Engineering (Q4 2025 April 1st - June 30th): Archiva Mirror Maven Central cache no space left on device - https://phabricator.wikimedia.org/T399679#11020835 (10amastilovic) 05Open→03Resolved [15:26:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:26:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:27:05] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:27:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:27:53] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [15:28:57] (03PS8) 10Btullis: Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (https://phabricator.wikimedia.org/T397160) [15:29:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T1530). [15:30:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:30:52] (03Merged) 10jenkins-bot: ZLangRegistry::fetchLanguageCodeFromZid: Check for invalid Title too [extensions/WikiLambda] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170376 (https://phabricator.wikimedia.org/T399755) (owner: 10Jforrester) [15:31:07] (03Merged) 10jenkins-bot: Provide a repo-mode pair of parser functions for showing label/description [extensions/WikiLambda] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171228 (owner: 10Jforrester) [15:31:20] (03PS1) 10David Caro: prometheus_node_pinger: don't fail service [puppet] - 10https://gerrit.wikimedia.org/r/1171233 [15:31:23] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1170376|ZLangRegistry::fetchLanguageCodeFromZid: Check for invalid Title too (T399755)]], [[gerrit:1171228|Provide a repo-mode pair of parser functions for showing label/description]] [15:31:27] T399755: TypeError: MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectByTitle(): Argument #1 ($title) must be of type MediaWiki\Title\Title, null given - https://phabricator.wikimedia.org/T399755 [15:31:50] (03CR) 10CI reject: [V:04-1] prometheus_node_pinger: don't fail service [puppet] - 10https://gerrit.wikimedia.org/r/1171233 (owner: 10David Caro) [15:32:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:32:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:32:55] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [15:34:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:34:09] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:34:32] (03PS2) 10David Caro: prometheus_node_pinger: don't fail service [puppet] - 10https://gerrit.wikimedia.org/r/1171233 [15:34:51] (03PS18) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [15:35:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:35:37] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6373/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:36:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:37:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:37:10] (03PS19) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [15:37:52] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6374/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:38:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:38:58] (03CR) 10Elukey: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [15:39:58] (03PS20) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [15:39:59] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T400061#11020951 (10phaultfinder) [15:40:41] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6375/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:41:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:41:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:42:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:44:09] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:44:09] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:45:59] andrew@cumin1003 reimage (PID 2655918) is awaiting input [15:47:07] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:47:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:47:10] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:48:12] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) [15:49:09] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:50:24] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1006.eqiad.wmnet with OS bookworm [15:51:20] (03PS21) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [15:52:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:54:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T399249)', diff saved to https://phabricator.wikimedia.org/P79527 and previous config saved to /var/cache/conftool/dbconfig/20250721-155421-marostegui.json [15:54:33] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:54:46] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [15:55:33] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1170376|ZLangRegistry::fetchLanguageCodeFromZid: Check for invalid Title too (T399755)]], [[gerrit:1171228|Provide a repo-mode pair of parser functions for showing label/description]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:55:37] T399755: TypeError: MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectByTitle(): Argument #1 ($title) must be of type MediaWiki\Title\Title, null given - https://phabricator.wikimedia.org/T399755 [15:56:26] !log jforrester@deploy1003 jforrester: Continuing with sync [15:56:59] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye [15:57:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:57:35] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye [15:58:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:59:07] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:59:54] (03PS1) 10Ayounsi: Routed Ganeti: also permit anycast to be advertised from VMs [homer/public] - 10https://gerrit.wikimedia.org/r/1171236 (https://phabricator.wikimedia.org/T362392) [16:00:14] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [16:00:16] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) [16:02:08] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:03:10] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:04:28] (03PS22) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [16:04:46] (03CR) 10Ssingh: [C:03+1] Bird: VM side - add support for Routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [16:05:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:05:10] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:05:14] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6377/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [16:06:08] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:06:37] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [16:06:57] (03PS23) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [16:07:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:07:42] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6378/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [16:08:05] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7001.* [16:08:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:08:24] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7001.magru.wmnet with reason: Dell support [16:08:57] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170376|ZLangRegistry::fetchLanguageCodeFromZid: Check for invalid Title too (T399755)]], [[gerrit:1171228|Provide a repo-mode pair of parser functions for showing label/description]] (duration: 37m 34s) [16:09:01] T399755: TypeError: MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectByTitle(): Argument #1 ($title) must be of type MediaWiki\Title\Title, null given - https://phabricator.wikimedia.org/T399755 [16:09:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P79528 and previous config saved to /var/cache/conftool/dbconfig/20250721-160930-marostegui.json [16:10:10] !log cumin 'A:cp' 'systemctl reset-failed update-ocsp-all.timer' - T399114 [16:10:11] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [16:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:16] T399114: Remove OCSP monitoring and related bits - https://phabricator.wikimedia.org/T399114 [16:11:08] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:13:08] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:13:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:34:07] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:34:11] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:34:56] (03CR) 10Cathal Mooney: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [16:36:20] (03CR) 10Cathal Mooney: cephosd: un-set bird bgp neighbors rather than override for each host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170543 (owner: 10Cathal Mooney) [16:36:22] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye [16:37:32] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11021310 (10cmooney) ` 2025-07-21 14:38 Hello Team, We are currently experiencing a major issue affecting your circuit due to instability in our channels betwe... [16:39:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:39:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T399249)', diff saved to https://phabricator.wikimedia.org/P79531 and previous config saved to /var/cache/conftool/dbconfig/20250721-163945-marostegui.json [16:39:50] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [16:40:01] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [16:40:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T399249)', diff saved to https://phabricator.wikimedia.org/P79532 and previous config saved to /var/cache/conftool/dbconfig/20250721-164008-marostegui.json [16:42:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.58s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:43:07] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:43:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:44:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:47:07] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:47:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.219s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:47:27] (03PS1) 10Matthias Mullie: Add new MediaSearch config/coefficients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171239 (https://phabricator.wikimedia.org/T385286) [16:49:11] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:50:07] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:51:36] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:51:47] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:52:09] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:52:38] FIRING: GoRoutinesTooHigh: gNMIc running on netflow2003 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [16:54:55] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:55:44] 10SRE-swift-storage, 10MinT, 10LPL Essential (2025 Jul-Sep), 10LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#11021375 (10Dzahn) Yay, thank you @KartikMistry ! :) (this means I should not get warni... [16:57:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 5/6 UP : 5 v2 P2P interfaces vs. 6 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:58:09] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:59:09] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:59:34] (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: fix nic names in ceph hiera [puppet] - 10https://gerrit.wikimedia.org/r/1171240 [16:59:40] FIRING: [7x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:59:55] FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T1700) [17:00:05] ryankemper: That opportune time for a Wikidata Query Service weekly deploy deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T1700). [17:00:14] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: fix nic names in ceph hiera [puppet] - 10https://gerrit.wikimedia.org/r/1171240 (owner: 10Andrew Bogott) [17:01:01] (03CR) 10FNegri: [C:03+1] prometheus_node_pinger: don't fail service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1171233 (owner: 10David Caro) [17:01:09] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:01:11] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:02:50] (03PS3) 10David Caro: prometheus_node_pinger: don't fail service [puppet] - 10https://gerrit.wikimedia.org/r/1171233 [17:02:53] (03CR) 10David Caro: prometheus_node_pinger: don't fail service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1171233 (owner: 10David Caro) [17:04:09] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:04:11] (03CR) 10David Caro: [C:03+2] prometheus_node_pinger: don't fail service [puppet] - 10https://gerrit.wikimedia.org/r/1171233 (owner: 10David Caro) [17:04:11] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 6/6 UP : 5 v2 P2P interfaces vs. 6 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:04:40] FIRING: [7x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:07:06] 06SRE, 10DNS, 06Traffic: Verify wikimediafoundation.org for Visual Studio Marketplace. - https://phabricator.wikimedia.org/T400089 (10Seddon) 03NEW [17:07:11] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:09:03] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) [17:09:40] RESOLVED: [6x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:13:27] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T399162 - bking@cumin1002 [17:13:32] T399162: Regression: Cirrus exact string regexp search for insource:/"u.a."/ has stopped working - https://phabricator.wikimedia.org/T399162 [17:14:22] (03CR) 10Stevemunene: [C:03+1] Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [17:16:09] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:16:11] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:18:30] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [17:19:09] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:19:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:19:46] (03PS1) 10Andrew Bogott: Revert "cloudcephosd2004-dev: fix nic names in ceph hiera" [puppet] - 10https://gerrit.wikimedia.org/r/1171244 [17:20:55] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:21:03] (03PS1) 10Jcrespo: installserver: Prepare dbprov1007, dbprov2007 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1171245 (https://phabricator.wikimedia.org/T399040) [17:21:09] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:21:09] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:22:04] (03CR) 10Andrew Bogott: [C:03+2] Revert "cloudcephosd2004-dev: fix nic names in ceph hiera" [puppet] - 10https://gerrit.wikimedia.org/r/1171244 (owner: 10Andrew Bogott) [17:22:12] (03PS2) 10Jcrespo: installserver: Prepare dbprov1007, dbprov2007 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1171245 (https://phabricator.wikimedia.org/T399040) [17:23:02] jouncebot: nowandnext [17:23:02] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T1700) [17:23:02] For the next 0 hour(s) and 6 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T1700) [17:23:02] In 2 hour(s) and 36 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T2000) [17:24:17] !log disabling puppet on A:cp (112 hosts) to deploy gerrit:117941 T274228 [17:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:23] T274228: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 [17:24:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:24:56] (03CR) 10Dzahn: [C:03+2] varnish: create new policy that allows websockets but also caches [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [17:27:01] (03PS1) 10BCornwall: Add TXT verification for visualstudio marketplace [dns] - 10https://gerrit.wikimedia.org/r/1171247 (https://phabricator.wikimedia.org/T400089) [17:27:05] !log deploying varnish change on cp4037 as test host [17:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:27] (03PS1) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [17:27:37] (03PS2) 10BCornwall: Add TXT verification for visualstudio marketplace [dns] - 10https://gerrit.wikimedia.org/r/1171247 (https://phabricator.wikimedia.org/T400089) [17:27:42] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [17:28:08] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:28:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:28:57] (03CR) 10BCornwall: "600 seems a little low but all the others are set to that." [dns] - 10https://gerrit.wikimedia.org/r/1171247 (https://phabricator.wikimedia.org/T400089) (owner: 10BCornwall) [17:29:47] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [17:30:51] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Verify wikimediafoundation.org for Visual Studio Marketplace. - https://phabricator.wikimedia.org/T400089#11021515 (10BCornwall) 05Open→03In progress p:05Triage→03Low a:03BCornwall [17:31:55] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:32:10] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:32:10] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:35:30] !log recovering database data T399980 [17:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:40] RESOLVED: [6x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:36:10] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:36:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:37:08] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:38:31] (03CR) 10Ssingh: "Looks good, two minor nits:" [dns] - 10https://gerrit.wikimedia.org/r/1171247 (https://phabricator.wikimedia.org/T400089) (owner: 10BCornwall) [17:38:53] andrew@cumin1003 reimage (PID 2684017) is awaiting input [17:39:00] (03PS1) 10BCornwall: ncredir: Funnel pywikipedia.org to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/1171250 (https://phabricator.wikimedia.org/T388809) [17:39:52] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10Mail: Access Request to DMarcDigests - https://phabricator.wikimedia.org/T399976#11021555 (10nisrael) @Aklapper my apologies! I will make a note to myself to do this for future tasks! [17:40:08] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:40:40] FIRING: [7x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:41:55] RESOLVED: [7x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:42:51] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T399162 - bking@cumin1002 [17:42:55] T399162: Regression: Cirrus exact string regexp search for insource:/"u.a."/ has stopped working - https://phabricator.wikimedia.org/T399162 [17:42:59] (03PS3) 10BCornwall: wikimediafoundation.org: Add VS TXT verification [dns] - 10https://gerrit.wikimedia.org/r/1171247 (https://phabricator.wikimedia.org/T400089) [17:43:38] (03PS1) 10Clare Ming: InstrumentConfigsFetcher: Make updating configs asynchronous [extensions/MetricsPlatform] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171251 (https://phabricator.wikimedia.org/T398422) [17:43:53] (03CR) 10BCornwall: wikimediafoundation.org: Add VS TXT verification (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1171247 (https://phabricator.wikimedia.org/T400089) (owner: 10BCornwall) [17:44:12] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:44:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171251 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming) [17:45:08] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:45:40] FIRING: [9x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:47:55] RESOLVED: [9x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:50:40] FIRING: [9x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:51:58] (03CR) 10Ssingh: [C:03+1] "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/1171247 (https://phabricator.wikimedia.org/T400089) (owner: 10BCornwall) [17:52:55] FIRING: [8x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:52:55] (03CR) 10BCornwall: [C:03+2] wikimediafoundation.org: Add VS TXT verification [dns] - 10https://gerrit.wikimedia.org/r/1171247 (https://phabricator.wikimedia.org/T400089) (owner: 10BCornwall) [17:52:58] (03CR) 10Pppery: [C:03+1] ncredir: Funnel pywikipedia.org to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/1171250 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [17:53:26] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T399162 - bking@cumin1002 [17:53:30] T399162: Regression: Cirrus exact string regexp search for insource:/"u.a."/ has stopped working - https://phabricator.wikimedia.org/T399162 [17:54:44] !log brett@dns1004 START - running authdns-update [17:55:00] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T400061#11021589 (10phaultfinder) [17:55:40] RESOLVED: [8x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:55:41] !log brett@dns1004 END - running authdns-update [17:56:35] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Verify wikimediafoundation.org for Visual Studio Marketplace. - https://phabricator.wikimedia.org/T400089#11021595 (10BCornwall) @Seddon This has been merged. Let me know if it works! [17:57:55] FIRING: [8x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:00:40] RESOLVED: [7x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:01:45] (03PS1) 10Dzahn: Revert "varnish: create new policy that allows websockets but also caches" [puppet] - 10https://gerrit.wikimedia.org/r/1171253 [18:02:20] (03CR) 10Dzahn: [C:03+2] Revert "varnish: create new policy that allows websockets but also caches" [puppet] - 10https://gerrit.wikimedia.org/r/1171253 (owner: 10Dzahn) [18:05:10] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:05:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:06:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T399249)', diff saved to https://phabricator.wikimedia.org/P79533 and previous config saved to /var/cache/conftool/dbconfig/20250721-180630-marostegui.json [18:07:10] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:07:12] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:07:46] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:13:10] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:13:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:15:10] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:15:12] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:17:17] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T399162 - bking@cumin1002 [18:17:55] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:18:02] T399162: Regression: Cirrus exact string regexp search for insource:/"u.a."/ has stopped working - https://phabricator.wikimedia.org/T399162 [18:18:10] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:18:14] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:20:45] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cirrussearch2095 [18:20:53] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cirrussearch2095 [18:21:10] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:21:12] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:21:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P79534 and previous config saved to /var/cache/conftool/dbconfig/20250721-182137-marostegui.json [18:22:39] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T399162 - bking@cumin1002 [18:22:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:22:44] T399162: Regression: Cirrus exact string regexp search for insource:/"u.a."/ has stopped working - https://phabricator.wikimedia.org/T399162 [18:24:31] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T399162 - bking@cumin1002 [18:25:01] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cirrussearch2079 [18:25:09] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cirrussearch2079 [18:25:12] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T399162 - bking@cumin1002 [18:27:05] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T399162 - bking@cumin1002 [18:27:40] (03PS2) 10Kimberly Sarabia: xLab: Add instrumentation for logged-out user retention [extensions/WikimediaEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171226 (https://phabricator.wikimedia.org/T399227) [18:27:40] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:30:10] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:30:24] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cirrussearch2078 [18:30:30] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cirrussearch2094 [18:30:42] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:30:48] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cirrussearch2078 [18:31:02] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cirrussearch2078 [18:33:10] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:34:12] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:35:10] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [18:36:10] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:36:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P79535 and previous config saved to /var/cache/conftool/dbconfig/20250721-183645-marostegui.json [18:37:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:38:49] (03PS1) 10Bking: cirrussearch: add missing nodes to CODFW pool [puppet] - 10https://gerrit.wikimedia.org/r/1171258 (https://phabricator.wikimedia.org/T399162) [18:38:57] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [18:40:10] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:41:44] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): decommission an-conf100[1-3] - https://phabricator.wikimedia.org/T398013#11021742 (10Jhancock.wm) [18:42:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:42:48] 06SRE, 10DNS, 06Traffic: Verify wikimediafoundation.org for Visual Studio Marketplace. - https://phabricator.wikimedia.org/T400089#11021750 (10Seddon) Thank you! Will keep you in the loop. [18:43:12] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:47:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:49:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:50:10] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:50:19] (03PS3) 10Bvibber: xLab: Add instrumentation for logged-out user retention [extensions/WikimediaEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171226 (https://phabricator.wikimedia.org/T399227) (owner: 10Kimberly Sarabia) [18:51:10] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:51:12] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:51:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T399249)', diff saved to https://phabricator.wikimedia.org/P79536 and previous config saved to /var/cache/conftool/dbconfig/20250721-185152-marostegui.json [18:51:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [18:51:57] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:52:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T399249)', diff saved to https://phabricator.wikimedia.org/P79537 and previous config saved to /var/cache/conftool/dbconfig/20250721-185203-marostegui.json [18:52:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:53:04] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [18:54:03] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [18:54:18] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:55:01] (03CR) 10Ssingh: "Thanks for the revert. I wanted to write a longer explanation but the solution is simple:" [puppet] - 10https://gerrit.wikimedia.org/r/1171253 (owner: 10Dzahn) [18:56:27] (03PS1) 10Dzahn: microsites: update recipient email for home dir size warning mails [puppet] - 10https://gerrit.wikimedia.org/r/1171260 (https://phabricator.wikimedia.org/T343364) [18:58:40] FIRING: [3x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:59:57] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T400061#11021825 (10phaultfinder) [19:01:37] (03CR) 10Dzahn: "example how these emails might look:" [puppet] - 10https://gerrit.wikimedia.org/r/1171260 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [19:03:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:04:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:04:26] (03PS1) 10Dzahn: peopleweb: fix variable name in home dir size warning email [puppet] - 10https://gerrit.wikimedia.org/r/1171261 (https://phabricator.wikimedia.org/T343364) [19:08:53] (03PS2) 10Bking: cirrussearch: add missing nodes to CODFW pool [puppet] - 10https://gerrit.wikimedia.org/r/1171258 (https://phabricator.wikimedia.org/T399162) [19:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:12:34] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [19:12:43] !log bking@cumin1002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [19:12:50] 10ops-codfw, 06SRE, 06DC-Ops: Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#11021879 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1002 for host cirrussearch2091.codfw.wmnet with OS bullseye [19:18:11] (03CR) 10Dzahn: [C:03+2] peopleweb: fix variable name in home dir size warning email [puppet] - 10https://gerrit.wikimedia.org/r/1171261 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [19:18:19] (03PS2) 10Dzahn: peopleweb: fix variable name in home dir size warning email [puppet] - 10https://gerrit.wikimedia.org/r/1171261 (https://phabricator.wikimedia.org/T343364) [19:18:56] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [19:21:22] !log bking@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2091.codfw.wmnet with OS bullseye [19:21:35] 10ops-codfw, 06SRE, 06DC-Ops: Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#11021898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1002 for host cirrussearch2091.codfw.wmnet with OS bullseye executed with errors:... [19:21:43] !log bking@cumin1002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [19:21:55] 10ops-codfw, 06SRE, 06DC-Ops: Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#11021899 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1002 for host cirrussearch2091.codfw.wmnet with OS bullseye [19:23:26] (03PS1) 10Dzahn: varnish: new policy to allow websockets and caching, apply to phab [puppet] - 10https://gerrit.wikimedia.org/r/1171263 (https://phabricator.wikimedia.org/T274228) [19:23:39] (03PS3) 10Bking: cirrussearch: add missing nodes to CODFW pool [puppet] - 10https://gerrit.wikimedia.org/r/1171258 (https://phabricator.wikimedia.org/T399162) [19:26:38] (03CR) 10Dzahn: [V:04-1] "parameter 'req_handling' entry 'phabricator.wikimedia.org' entry 'caching' expects a match for Profile::Cache::Caching" [puppet] - 10https://gerrit.wikimedia.org/r/1171263 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [19:26:38] (03CR) 10Ebernhardson: [C:03+1] "looks to match the nodes connected to the port 9200 cluster (well, plus the 2 out of service nodes)" [puppet] - 10https://gerrit.wikimedia.org/r/1171258 (https://phabricator.wikimedia.org/T399162) (owner: 10Bking) [19:26:38] 06SRE: FY 25/26 WE 5.4.2: Known bots / clients - https://phabricator.wikimedia.org/T400100 (10Scott_French) 03NEW [19:26:38] 06SRE: FY 25/26 WE 5.4.2: Known bots / clients - https://phabricator.wikimedia.org/T400100#11021918 (10Scott_French) p:05Triage→03High [19:26:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:27:13] (03PS2) 10Dzahn: varnish: new policy to allow websockets and caching, apply to phab [puppet] - 10https://gerrit.wikimedia.org/r/1171263 (https://phabricator.wikimedia.org/T274228) [19:27:54] (03CR) 10Dzahn: [C:03+2] "new change here, as agreed on IRC: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1171263" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [19:27:59] (03CR) 10Dzahn: [C:03+2] "new change here, as agreed on IRC: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1171263" [puppet] - 10https://gerrit.wikimedia.org/r/1171253 (owner: 10Dzahn) [19:31:03] (03CR) 10Dzahn: [V:03+1] "this version compiles and shows a diff now: https://puppet-compiler.wmflabs.org/output/1171263/6386/cp7003.magru.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1171263 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [19:31:03] (03PS1) 10JHathaway: install_server: fix partman config for ml-serve101[23] [puppet] - 10https://gerrit.wikimedia.org/r/1171264 (https://phabricator.wikimedia.org/T393948) [19:31:03] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1171264 (https://phabricator.wikimedia.org/T393948) (owner: 10JHathaway) [19:31:21] re: jinxer-wm: yes, puppet is still failing for basically the entire analytics cluster.. since quite some time. reported multiple times on IRC to no avail. [19:31:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:32:45] (03CR) 10Dzahn: [C:03+2] peopleweb: fix variable name in home dir size warning email [puppet] - 10https://gerrit.wikimedia.org/r/1171261 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [19:32:52] (03CR) 10JHathaway: [C:03+2] install_server: fix partman config for ml-serve101[23] [puppet] - 10https://gerrit.wikimedia.org/r/1171264 (https://phabricator.wikimedia.org/T393948) (owner: 10JHathaway) [19:34:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:34:27] !log bking@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2091.codfw.wmnet with OS bullseye [19:34:32] 10ops-codfw, 06SRE, 06DC-Ops: Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#11021967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1002 for host cirrussearch2091.codfw.wmnet with OS bullseye executed with errors:... [19:35:18] mutante: most of the failures seem related to starting hadoop-yarn-nodemanager [19:35:50] yea [19:36:58] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [19:37:24] jhathaway mutante based on https://wikimedia.slack.com/archives/C055QGPTC69/p1753098712716759 , this puppet patch ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/1171222 ) might be the solution. looking... [19:37:50] ah, it has a couple of +1's already, I'm gonna go ahead and merge [19:37:54] (03CR) 10Bking: [C:03+2] Bigtop: Move the excluded_hosts values into profile::hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/1171222 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [19:39:04] thanks inflatador [19:39:52] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T400061#11021998 (10phaultfinder) [19:40:43] np, running puppet against the hosts listed on the patch now [19:44:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:46:49] (03CR) 10Bking: [C:03+2] cirrussearch: add missing nodes to CODFW pool [puppet] - 10https://gerrit.wikimedia.org/r/1171258 (https://phabricator.wikimedia.org/T399162) (owner: 10Bking) [19:49:04] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch2064.codfw.wmnet|cirrussearch2073.codfw.wmnet|cirrussearch2078.codfw.wmnet|cirrussearch2094.codfw.wmnet|cirrussearch2095.codfw.wmnet|cirrussearch2096.codfw.wmnet|cirrussearch2110.codfw.wmnet [19:49:25] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [19:49:26] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [19:49:29] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [19:53:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11022038 (10VRiley-WMF) [19:55:18] jhathaway mutante bad news on that patch, it doesn't look like it worked. I ACKed the alert but I'm not sure I'll have time to fix it today. Ben and Steve are aware of the issue, but please do ping us in the Slack thread if this isn't fixed within 24 hours [19:55:31] (or just ping me directly ;) ) [19:56:02] thanks for the update inflatador [19:56:55] thanks inflatador [19:58:56] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [19:58:58] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [19:59:01] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [19:59:34] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T2000). [20:00:05] kimberly_sarabia and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] o/ [20:00:29] i'll be deploying kimberly's patch [20:00:54] bvibber: sounds good ! will you ping me when you're done? [20:00:58] sure! [20:01:01] Would there be room for a last-minute backport? Not terribly urgent so it's fine if not. Master patch is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CampaignEvents/+/1169691 still merging. [20:01:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171226 (https://phabricator.wikimedia.org/T399227) (owner: 10Kimberly Sarabia) [20:01:26] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [20:01:35] I was going to add it to the calendar, but then realized the backport window is now, not in 1 hour [20:01:44] Daimona: sure - happy to deploy - can you add to the deployment calendar? [20:02:27] (03Merged) 10jenkins-bot: xLab: Add instrumentation for logged-out user retention [extensions/WikimediaEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171226 (https://phabricator.wikimedia.org/T399227) (owner: 10Kimberly Sarabia) [20:02:29] Thanks! Once the master patch merges I'll cherry-pick it and add it to the calendar. Hopefully it passes CI on the first attempt.. [20:02:43] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1171226|xLab: Add instrumentation for logged-out user retention (T399227)]] [20:02:47] i love CI [20:02:48] T399227: [Epic] Perform an A/A test for retention baseline - https://phabricator.wikimedia.org/T399227 [20:02:58] we have to wait for it but "back in my day" it was much easier to break production haha [20:04:05] dont https://en.wikipedia.org/wiki/Jinx it [20:04:40] hehe [20:04:53] !log bvibber@deploy1003 ksarabia, bvibber: Backport for [[gerrit:1171226|xLab: Add instrumentation for logged-out user retention (T399227)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:05:08] kimberly_sarabia: is there anything you can test on the debug servers? or let it continuel [20:05:11] *continue [20:05:39] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye [20:06:03] No, not really, we should be able to just continue just because of the super low traffic [20:06:06] ok :D [20:06:13] !log bvibber@deploy1003 ksarabia, bvibber: Continuing with sync [20:06:15] Only going to be on testwiki for now [20:09:19] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [20:10:03] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [20:10:08] ok it's going out to the production servers, just a few more mins :D [20:10:32] ty [20:11:19] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [20:13:36] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [20:13:37] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171226|xLab: Add instrumentation for logged-out user retention (T399227)]] (duration: 10m 54s) [20:13:45] T399227: [Epic] Perform an A/A test for retention baseline - https://phabricator.wikimedia.org/T399227 [20:13:51] kimberly_sarabia: complete! [20:14:00] cjming: all yours [20:14:04] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye [20:14:06] ty! [20:14:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.161s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:14:20] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye [20:14:32] bvibber: thanks [20:14:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171251 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming) [20:15:59] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [20:16:16] 06SRE, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11022093 (10Pppery) [20:16:32] 06SRE, 06Traffic, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11022094 (10Pppery) [20:17:25] (03Merged) 10jenkins-bot: InstrumentConfigsFetcher: Make updating configs asynchronous [extensions/MetricsPlatform] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171251 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming) [20:17:41] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1171251|InstrumentConfigsFetcher: Make updating configs asynchronous (T398422)]] [20:17:45] T398422: MetricsPlatform: InstrumentConfigFetcher: Make fetching asynchronous - https://phabricator.wikimedia.org/T398422 [20:19:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.161s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:19:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T399249)', diff saved to https://phabricator.wikimedia.org/P79539 and previous config saved to /var/cache/conftool/dbconfig/20250721-201939-marostegui.json [20:19:46] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [20:24:21] (03PS1) 10Daimona Eaytoy: Modifications to UpdateCountriesScript [extensions/CampaignEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171268 (https://phabricator.wikimedia.org/T397270) [20:24:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/CampaignEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171268 (https://phabricator.wikimedia.org/T397270) (owner: 10Daimona Eaytoy) [20:24:52] There we go. CI has been kind. [20:26:09] (03PS1) 10Bking: cirrussearch: move cirrussearch20(89|91) to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1171270 (https://phabricator.wikimedia.org/T400099) [20:27:30] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.* [20:29:09] FIRING: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-codfw.service on cirrussearch2072:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:30:13] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage [20:30:26] (03PS1) 10Cwhite: beta-logs: provision logging-logstash-04 [puppet] - 10https://gerrit.wikimedia.org/r/1171272 (https://phabricator.wikimedia.org/T353912) [20:30:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:33:31] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage [20:33:32] 06SRE, 06Traffic, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11022145 (10ssingh) Thanks for sharing; we are looking into it. [20:34:46] (03CR) 10Cwhite: [C:03+2] beta-logs: provision logging-logstash-04 [puppet] - 10https://gerrit.wikimedia.org/r/1171272 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite) [20:34:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P79540 and previous config saved to /var/cache/conftool/dbconfig/20250721-203446-marostegui.json [20:35:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:41:41] !log cjming@deploy1003 cjming: Backport for [[gerrit:1171251|InstrumentConfigsFetcher: Make updating configs asynchronous (T398422)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:41:46] T398422: MetricsPlatform: InstrumentConfigFetcher: Make fetching asynchronous - https://phabricator.wikimedia.org/T398422 [20:42:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:49:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P79541 and previous config saved to /var/cache/conftool/dbconfig/20250721-204954-marostegui.json [20:51:24] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye [20:51:37] !log cjming@deploy1003 cjming: Continuing with sync [20:52:38] FIRING: GoRoutinesTooHigh: gNMIc running on netflow2003 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [20:53:25] (03PS2) 10Eevans: Add data-gateway listener to mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1113581 (https://phabricator.wikimedia.org/T368096) [20:55:03] (03CR) 10Ebernhardson: [C:03+1] "confirm these hosts are out of rotation" [puppet] - 10https://gerrit.wikimedia.org/r/1171270 (https://phabricator.wikimedia.org/T400099) (owner: 10Bking) [20:55:27] Daimona: i'm sorry - i have to run here soon - is it ok to push to a later window? [20:55:41] Yup, totally! [20:55:49] ty [20:55:59] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [20:56:10] Thank you anyway! [20:59:09] RESOLVED: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-codfw.service on cirrussearch2072:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T2100). [21:00:32] (03PS1) 10Andrew Bogott: cloudceph codfw1: begin upgrade to Quincy [puppet] - 10https://gerrit.wikimedia.org/r/1171274 [21:01:03] about to do a scap deploy for the security deploy window [21:01:04] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1171274 (owner: 10Andrew Bogott) [21:01:32] 06SRE, 06Traffic, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11022178 (10Scott_French) @Audiodude - At least at the version of `mwclient` that you appear to be using (0.9.3 per your [[ https://github.com/openzim/wp1/blob/main/Pipfile |... [21:04:06] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171251|InstrumentConfigsFetcher: Make updating configs asynchronous (T398422)]] (duration: 46m 25s) [21:04:11] T398422: MetricsPlatform: InstrumentConfigFetcher: Make fetching asynchronous - https://phabricator.wikimedia.org/T398422 [21:04:12] !log end of UTC late backport window [21:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T399249)', diff saved to https://phabricator.wikimedia.org/P79542 and previous config saved to /var/cache/conftool/dbconfig/20250721-210501-marostegui.json [21:05:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance [21:05:07] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:05:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [21:05:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T399249)', diff saved to https://phabricator.wikimedia.org/P79544 and previous config saved to /var/cache/conftool/dbconfig/20250721-210531-marostegui.json [21:05:49] (03CR) 10Bking: [C:03+2] cirrussearch: move cirrussearch20(89|91) to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1171270 (https://phabricator.wikimedia.org/T400099) (owner: 10Bking) [21:07:02] (03PS2) 10Andrew Bogott: cloudceph codfw1: begin upgrade to Quincy [puppet] - 10https://gerrit.wikimedia.org/r/1171274 [21:07:33] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1171274 (owner: 10Andrew Bogott) [21:09:09] FIRING: [3x] SystemdUnitFailed: opensearch-disable-readahead-production-search-codfw.service on cirrussearch2072:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:03] 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11022198 (10bking) a:05bking→03None [21:13:07] (03PS1) 10JHathaway: WIP: uefi hack [cookbooks] - 10https://gerrit.wikimedia.org/r/1171278 [21:13:13] (03PS3) 10Andrew Bogott: cloudceph codfw1: begin upgrade to Quincy [puppet] - 10https://gerrit.wikimedia.org/r/1171274 [21:13:20] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [21:14:14] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [21:14:28] !log jhathaway@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1012.eqiad.wmnet with OS bookworm [21:15:07] (03CR) 10Andrew Bogott: [C:03+2] cloudceph codfw1: begin upgrade to Quincy [puppet] - 10https://gerrit.wikimedia.org/r/1171274 (owner: 10Andrew Bogott) [21:15:24] (03PS1) 10Andrew Bogott: aptrepo: support ceph/quincy on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1171279 [21:18:40] Hey, Dreamy_Jazz, we should be scapping out the patch for T399627 soon. [21:18:49] Thanks for the heads up [21:19:46] (it’s actually going out rn) [21:20:58] (03PS2) 10JHathaway: WIP: uefi hack [cookbooks] - 10https://gerrit.wikimedia.org/r/1171278 [21:21:43] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [21:21:57] !log jhathaway@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1012.eqiad.wmnet with OS bookworm [21:24:26] 06SRE, 06Traffic, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11022205 (10Audiodude) Thanks for taking the time and looking into what mwclient version we use in production. Upgrading to 0.11.0 was the first thing I did when attempting t... [21:25:50] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [21:25:56] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [21:28:47] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [21:29:01] !log deploy security patch for T399627 [21:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:09] FIRING: [3x] SystemdUnitFailed: opensearch-disable-readahead-production-search-codfw.service on cirrussearch2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:30:30] starting second and final scap deployment for security [21:31:51] Dreamy_Jazz: feel free to test T399627 now [21:32:08] Thanks. Testing it. [21:32:19] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cirrussearch2061 [21:32:25] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cirrussearch2061 [21:32:45] 06SRE, 06Traffic, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11022221 (10Scott_French) Thanks for the follow-up @Audiodude. So, from a quick look at 0.11.0 it looks like [[ https://github.com/mwclient/mwclient/blob/v0.11.0/mwclient/cli... [21:34:10] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [21:34:55] (03PS4) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [21:35:00] sbassett: Unfortunately it seems the fix hasn't worked [21:35:56] (03PS5) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [21:36:01] Is there a chance that any server could be out of date? [21:36:05] I presume not [21:36:37] I don't think so [21:37:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:37:48] hmm, that patch is definitely on wmf.10: d4be39c5f4 [21:38:11] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cirrussearch2061* [21:38:13] maybe the patch just didn’t completely solve the problem? we can pull it and re-deploy if necessary. [21:38:20] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cirrussearch2061.* [21:38:29] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cirrussearch2061.* [21:38:42] I'm thinking it didn't fully fix it. I can try testing this locally again. [21:39:09] FIRING: [3x] SystemdUnitFailed: opensearch-disable-readahead-production-search-codfw.service on cirrussearch2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:13] !log deploy securty fix for T399662 [21:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:22] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [21:39:27] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [21:39:30] trying that again due to mis-spelling [21:39:51] !log deploy security fix for T399662 [21:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:41] Dreamy_Jazz: if it doesn’t look like it fully solves the issue, we should pull the patch, re-deploy and follow up on the bug. [21:42:03] Yeah. I don't think it actually solves the issue at all [21:42:14] I'll try it locally but in the interim it should be pulled [21:42:31] okay I'll go ahead and pull that now [21:43:14] ok, tx, maryum and Dreamy_Jazz [21:44:35] running scap again to undeploy security fix [21:45:12] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [21:48:01] (03PS2) 10Andrew Bogott: aptrepo: support ceph/quincy on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1171279 [21:48:01] (03PS1) 10Andrew Bogott: cloudceph: comment out ceph versions in 'common' [puppet] - 10https://gerrit.wikimedia.org/r/1171284 [21:48:39] (03CR) 10CI reject: [V:04-1] cloudceph: comment out ceph versions in 'common' [puppet] - 10https://gerrit.wikimedia.org/r/1171284 (owner: 10Andrew Bogott) [21:49:21] (03PS3) 10Andrew Bogott: aptrepo: support ceph/quincy on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1171279 [21:49:21] (03PS1) 10Andrew Bogott: cloudceph: comment out ceph versions in 'common' [puppet] - 10https://gerrit.wikimedia.org/r/1171285 [21:49:30] (03Abandoned) 10Andrew Bogott: cloudceph: comment out ceph versions in 'common' [puppet] - 10https://gerrit.wikimedia.org/r/1171284 (owner: 10Andrew Bogott) [21:50:45] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1171285 (owner: 10Andrew Bogott) [21:52:29] (03PS1) 10JHathaway: install_server: fix nvme config again, off by one [puppet] - 10https://gerrit.wikimedia.org/r/1171287 (https://phabricator.wikimedia.org/T393948) [21:52:32] !log undeploy security fix for T399627 [21:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:45] (03CR) 10Andrew Bogott: [C:03+2] cloudceph: comment out ceph versions in 'common' [puppet] - 10https://gerrit.wikimedia.org/r/1171285 (owner: 10Andrew Bogott) [21:53:24] (03PS2) 10JHathaway: install_server: fix nvme config, again, off by one [puppet] - 10https://gerrit.wikimedia.org/r/1171287 (https://phabricator.wikimedia.org/T393948) [21:53:30] (03PS2) 10Andrew Bogott: cloudceph: comment out ceph versions in 'common' [puppet] - 10https://gerrit.wikimedia.org/r/1171285 [21:53:30] (03PS4) 10Andrew Bogott: aptrepo: support ceph/quincy on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1171279 [21:53:37] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1171285 (owner: 10Andrew Bogott) [21:54:14] RESOLVED: [3x] SystemdUnitFailed: opensearch-disable-readahead-production-search-codfw.service on cirrussearch2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:57:03] (03CR) 10JHathaway: [C:03+2] install_server: fix nvme config, again, off by one [puppet] - 10https://gerrit.wikimedia.org/r/1171287 (https://phabricator.wikimedia.org/T393948) (owner: 10JHathaway) [22:02:08] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [22:02:13] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [22:05:55] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [22:09:57] sbasset: I may have a fix for the security patch for T399627 [22:10:08] sbassett: [22:10:37] jouncebot: nowandnext [22:10:37] For the next 0 hour(s) and 49 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T2100) [22:10:37] In 0 hour(s) and 49 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T2300) [22:12:04] I'll get it ready for tomorrow and see about deploying it then [22:13:47] (03CR) 10BCornwall: "FWIW the name conflict is present in the varnish test suite as well, so this wasn't introducing anything new 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [22:19:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11022309 (10jhathaway) @elukey, I was able to get ml-serve1012 to install the base os, after fixing the raid config, https://gerrit.wikimedia.org/... [22:22:48] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [22:22:53] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [22:24:50] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [22:28:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T399249)', diff saved to https://phabricator.wikimedia.org/P79546 and previous config saved to /var/cache/conftool/dbconfig/20250721-222855-marostegui.json [22:29:00] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:44:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P79547 and previous config saved to /var/cache/conftool/dbconfig/20250721-224402-marostegui.json [22:49:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:55:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:59:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P79548 and previous config saved to /var/cache/conftool/dbconfig/20250721-225910-marostegui.json [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T2300) [23:00:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:01:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:14:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T399249)', diff saved to https://phabricator.wikimedia.org/P79549 and previous config saved to /var/cache/conftool/dbconfig/20250721-231417-marostegui.json [23:14:24] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [23:14:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [23:18:24] 06SRE, 06Traffic, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11022400 (10Audiodude) Thanks for that. I think you might have incorrect assumptions about the wp1 code, though. We do not attempt to set any custom "WP 1.0 Bot" user agent,... [23:21:00] 06SRE, 06Traffic, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11022401 (10Audiodude) Oh nevermind, we do use a connection pool in order to re-use the login cookies! So it seems according to your analysis (which I just confirmed), becaus... [23:21:29] jouncebot: nowandnext [23:21:29] For the next 0 hour(s) and 38 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250721T2300) [23:21:29] In 2 hour(s) and 38 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T0200) [23:22:50] (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on remaining large s2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170371 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [23:23:43] (03Merged) 10jenkins-bot: Set categorylinks to read new on remaining large s2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170371 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [23:24:23] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1170371|Set categorylinks to read new on remaining large s2 wikis (T397912)]] [23:24:28] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [23:26:25] !log zabe@deploy1003 zabe: Backport for [[gerrit:1170371|Set categorylinks to read new on remaining large s2 wikis (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:27:12] !log zabe@deploy1003 zabe: Continuing with sync [23:32:46] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170371|Set categorylinks to read new on remaining large s2 wikis (T397912)]] (duration: 08m 23s) [23:32:51] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [23:36:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.704s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:38:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1171302 [23:38:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1171302 (owner: 10TrainBranchBot) [23:41:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.401s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:45:03] (03PS1) 10Zabe: Set categorylinks to read new on remaining s2 and s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171303 [23:45:51] (03CR) 10CI reject: [V:04-1] Set categorylinks to read new on remaining s2 and s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171303 (owner: 10Zabe) [23:46:09] (03PS2) 10Zabe: Set categorylinks to read new on remaining s2 and s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171303 [23:47:29] (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on remaining s2 and s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171303 (owner: 10Zabe) [23:48:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171303 (owner: 10Zabe) [23:48:20] (03Merged) 10jenkins-bot: Set categorylinks to read new on remaining s2 and s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171303 (owner: 10Zabe) [23:48:33] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1171303|Set categorylinks to read new on remaining s2 and s3 wikis]] [23:50:33] !log zabe@deploy1003 zabe: Backport for [[gerrit:1171303|Set categorylinks to read new on remaining s2 and s3 wikis]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:51:08] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:08] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1108 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:08] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1109 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:08] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1119 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:08] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1125 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:20] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1075 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:20] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1073 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:20] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1087 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:20] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1085 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:20] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1092 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:20] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:21] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:21] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:22] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:22] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1112 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:24] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:24] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1120 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:25] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:26] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1081 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:26] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1068 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:26] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1089 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:28] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1077 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1100 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:30] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1102 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:30] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:31] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:31] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1069 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:32] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1095 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:32] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1115 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:37] !log zabe@deploy1003 zabe: Continuing with sync [23:51:38] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:38] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1091 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:38] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1101 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:38] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:38] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1086 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:38] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1098 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:40] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:40] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:41] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:41] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1107 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:42] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1171302 (owner: 10TrainBranchBot) [23:51:43] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:43] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1118 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:44] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1078 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:44] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:58] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1074 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:51:58] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:52:02] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:52:12] ^^ looking at it [23:52:18] checking [23:53:27] ryankemper it's exactly the same problem with quorum we had a couple of weeks ago, I'm gonna depool eqiad again while we work on it. One sec [23:53:46] inflatador: yeah was about to say the same, requests not getting through [23:53:53] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad [23:54:14] FIRING: SystemdUnitFailed: opensearch-disable-readahead-production-search-eqiad.service on cirrussearch1113:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:55:08] inflatador: what'd we do last time, just restart the masters? [23:55:43] ryankemper https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Cluster_Quorum_Loss_Recovery_Procedure ... basically, find the last master that's been restarted and stop it [23:56:52] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171303|Set categorylinks to read new on remaining s2 and s3 wikis]] (duration: 08m 19s) [23:59:34] (03PS1) 10Ryan Kemper: cirrus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1171307 [23:59:43] 06SRE, 06Traffic, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11022420 (10Scott_French) Thanks for taking a closer look. Indeed, what's likely happening is that, in the absence of User-Agent being explicitly set, the `requests` library...