[00:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172126 [00:08:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172126 (owner: 10TrainBranchBot) [00:27:33] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172126 (owner: 10TrainBranchBot) [00:34:09] FIRING: SystemdUnitFailed: opensearch-disable-readahead-production-search-psi-codfw.service on cirrussearch2071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:44:09] FIRING: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-psi-codfw.service on cirrussearch2071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:46:41] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/e4ff9c7720fba1048fff41692f396062e64759d90d363518465c4b72e4daaf8a/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:59:09] FIRING: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-psi-codfw.service on cirrussearch2071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:01:51] !log ryankemper@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - ryankemper@cumin1002 - T397227 [01:01:56] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [01:06:41] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:14:09] RESOLVED: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-psi-codfw.service on cirrussearch2071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:28:11] !log [Cirrus] `ryankemper@cirrussearch2071:~$ sudo systemctl restart opensearch-disable-readahead-production-search-psi-codfw.service` [01:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:19] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:35:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:56:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:56:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:59:05] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:59:11] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:00:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:01:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:01:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:02:19] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:03:21] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 148564 MB (3% inode=99%): /var/lib/hadoop/data/g 147180 MB (3% inode=99%): /var/lib/hadoop/data/j 147069 MB (3% inode=99%): /var/lib/hadoop/data/c 141308 MB (3% inode=99%): /var/lib/hadoop/data/b 146487 MB (3% inode=99%): /var/lib/hadoop/data/l 146676 MB (3% inode=99%): /var/lib/hadoop/data/k 146423 MB (3% inode=99%): /var/lib/hadoop/data [04:03:21] 1 MB (4% inode=99%): /var/lib/hadoop/data/i 144202 MB (3% inode=99%): /var/lib/hadoop/data/m 153542 MB (4% inode=99%): /var/lib/hadoop/data/d 153568 MB (4% inode=99%): /var/lib/hadoop/data/h 149908 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [05:28:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2218.codfw.wmnet with reason: Maintenance [05:32:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P79785 and previous config saved to /var/cache/conftool/dbconfig/20250724-053236-root.json [05:47:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79786 and previous config saved to /var/cache/conftool/dbconfig/20250724-054743-root.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T0600) [06:00:05] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T0600). [06:02:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79787 and previous config saved to /var/cache/conftool/dbconfig/20250724-060249-root.json [06:05:51] (03PS1) 10Marostegui: mariadb: Productionze es1049-es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1172182 (https://phabricator.wikimedia.org/T400198) [06:17:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79788 and previous config saved to /var/cache/conftool/dbconfig/20250724-061755-root.json [06:33:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79789 and previous config saved to /var/cache/conftool/dbconfig/20250724-063300-root.json [06:47:44] (03CR) 10Marostegui: [C:03+2] mariadb: Productionze es1049-es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1172182 (https://phabricator.wikimedia.org/T400198) (owner: 10Marostegui) [06:50:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11029953 (10Marostegui) [06:50:33] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11029954 (10Marostegui) Patches done [06:51:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [06:52:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:52:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T399249)', diff saved to https://phabricator.wikimedia.org/P79790 and previous config saved to /var/cache/conftool/dbconfig/20250724-065222-marostegui.json [06:52:28] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [06:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:57:51] (03PS1) 10Marostegui: mariadb: Productionze es2049-es2057 [puppet] - 10https://gerrit.wikimedia.org/r/1172192 (https://phabricator.wikimedia.org/T400195) [07:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:27] (03CR) 10Marostegui: [C:03+2] mariadb: Productionze es2049-es2057 [puppet] - 10https://gerrit.wikimedia.org/r/1172192 (https://phabricator.wikimedia.org/T400195) (owner: 10Marostegui) [07:01:13] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11029965 (10Marostegui) Patches done [07:02:00] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11029966 (10Marostegui) [07:03:21] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 153521 MB (4% inode=99%): /var/lib/hadoop/data/g 156826 MB (4% inode=99%): /var/lib/hadoop/data/j 156367 MB (4% inode=99%): /var/lib/hadoop/data/c 149979 MB (3% inode=99%): /var/lib/hadoop/data/b 155594 MB (4% inode=99%): /var/lib/hadoop/data/l 160933 MB (4% inode=99%): /var/lib/hadoop/data/k 157103 MB (4% inode=99%): /var/lib/hadoop/data [07:03:21] 1 MB (4% inode=99%): /var/lib/hadoop/data/i 157691 MB (4% inode=99%): /var/lib/hadoop/data/m 155571 MB (4% inode=99%): /var/lib/hadoop/data/d 159581 MB (4% inode=99%): /var/lib/hadoop/data/h 159658 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [07:07:23] (03PS1) 10Marostegui: db1227: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172195 (https://phabricator.wikimedia.org/T399955) [07:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:11:36] (03CR) 10Marostegui: [C:03+2] db1227: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172195 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [07:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:12:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1227.eqiad.wmnet with reason: Maintenance [07:13:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1227 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79791 and previous config saved to /var/cache/conftool/dbconfig/20250724-071300-marostegui.json [07:20:13] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11029993 (10Marostegui) >>! In T399927#11028165, @Jhancock.wm wrote: > @Marostegui lemme know when you want to do es2036 I can have it ready today if'd like [07:20:59] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11029994 (10Marostegui) [07:21:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79792 and previous config saved to /var/cache/conftool/dbconfig/20250724-072100-root.json [07:36:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79793 and previous config saved to /var/cache/conftool/dbconfig/20250724-073606-root.json [07:36:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T399249)', diff saved to https://phabricator.wikimedia.org/P79794 and previous config saved to /var/cache/conftool/dbconfig/20250724-073628-marostegui.json [07:36:34] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:41:25] (03CR) 10Vgutierrez: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [07:51:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79795 and previous config saved to /var/cache/conftool/dbconfig/20250724-075112-root.json [07:51:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P79796 and previous config saved to /var/cache/conftool/dbconfig/20250724-075135-marostegui.json [07:54:52] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171147 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [07:55:35] (03CR) 10Ilias Sarantopoulos: "It seems that something is not working right. If you check Jenkins there seems to be no diffs found https://integration.wikimedia.org/ci/j" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis) [08:03:06] (03PS8) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [08:04:41] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [08:06:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79797 and previous config saved to /var/cache/conftool/dbconfig/20250724-080617-root.json [08:06:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P79798 and previous config saved to /var/cache/conftool/dbconfig/20250724-080643-marostegui.json [08:10:04] (03Abandoned) 10Bartosz Wójtowicz: ml-services: Update image version for revertrisk models on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171147 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [08:14:24] (03CR) 10Elukey: "Thanks for the review Jesse!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [08:15:42] 06SRE, 10vm-requests, 13Patch-For-Review: eqiad: VMs requested for Data Persistence automation and testbeds - https://phabricator.wikimedia.org/T390087#11030139 (10FCeratto-WMF) I opened a puppet CR with the following setup: ` db-test1000 eqiad primary master db-test1001 db-test1002 db-test2001 codfw dc-mas... [08:16:28] (03PS9) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [08:17:37] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [08:20:32] (03CR) 10Volans: "replied inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [08:21:06] (03CR) 10Stevemunene: [C:03+2] zookeeper: Add an-druid100[45] to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1171208 (https://phabricator.wikimedia.org/T397440) (owner: 10Stevemunene) [08:21:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T399249)', diff saved to https://phabricator.wikimedia.org/P79799 and previous config saved to /var/cache/conftool/dbconfig/20250724-082150-marostegui.json [08:21:56] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [08:22:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:22:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T399249)', diff saved to https://phabricator.wikimedia.org/P79800 and previous config saved to /var/cache/conftool/dbconfig/20250724-082213-marostegui.json [08:22:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11030150 (10elukey) [08:23:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11030153 (10elukey) [08:23:50] (03PS10) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [08:25:43] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [08:28:41] (03PS11) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [08:30:07] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [08:33:15] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11030188 (10fgiunchedi) In that case yes it seems an ad-hoc prometheus instance to run compaction on blocks might be viable, cfr https://github.... [08:36:13] (03CR) 10Fabfur: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [08:40:50] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11030247 (10elukey) Created https://wikitech.wikimedia.org/wiki/Maps/v2/Common_tasks#Warm_up_the_Tegola_tiles_cache_from_scratch [08:42:50] (03PS12) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [08:44:00] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [08:56:04] (03CR) 10Ilias Sarantopoulos: "for the rps metric it could be because the previous annotations are there. You could try the following 2 options:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis) [09:03:33] (03PS3) 10Vgutierrez: site,lvs,cumin: Stop using lvs1013 as liberica canary instance [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) [09:04:51] (03CR) 10Vgutierrez: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) (owner: 10Vgutierrez) [09:05:11] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) (owner: 10Vgutierrez) [09:08:13] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:08:41] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:09:41] PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:10:04] (03PS1) 10Effie Mouzeli: prometheus::ops: fix hcaptcha query [puppet] - 10https://gerrit.wikimedia.org/r/1172264 [09:10:31] (03CR) 10CI reject: [V:04-1] prometheus::ops: fix hcaptcha query [puppet] - 10https://gerrit.wikimedia.org/r/1172264 (owner: 10Effie Mouzeli) [09:11:13] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:11:31] RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:11:37] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:12:35] (03PS2) 10Effie Mouzeli: prometheus::ops: fix hcaptcha query [puppet] - 10https://gerrit.wikimedia.org/r/1172264 [09:12:52] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs1013.eqiad.wmnet} and A:liberica (T400259) [09:12:57] T400259: Stop using lvs1013 as a liberica canary - https://phabricator.wikimedia.org/T400259 [09:13:12] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs1013.eqiad.wmnet} and A:liberica (T400259) [09:15:55] (03CR) 10Vgutierrez: [C:03+2] site,lvs,cumin: Stop using lvs1013 as liberica canary instance [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) (owner: 10Vgutierrez) [09:18:05] (03PS1) 10Elukey: redfish: simplify change_user_password for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172265 (https://phabricator.wikimedia.org/T396365) [09:19:21] (03CR) 10Effie Mouzeli: [C:03+2] prometheus::ops: fix hcaptcha query [puppet] - 10https://gerrit.wikimedia.org/r/1172264 (owner: 10Effie Mouzeli) [09:22:26] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm [09:25:40] (03PS13) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [09:26:55] (03CR) 10CI reject: [V:04-1] redfish: simplify change_user_password for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172265 (https://phabricator.wikimedia.org/T396365) (owner: 10Elukey) [09:27:07] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:29:42] FIRING: JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:30:15] that's expected, no more liberica instances in eqiad at the moment [09:31:46] (03CR) 10Effie Mouzeli: "Thank you for this Lucas! We had attempted in the past to enable coredumps properly, and run into issues such as servers becoming unrespon" [puppet] - 10https://gerrit.wikimedia.org/r/1172010 (https://phabricator.wikimedia.org/T400247) (owner: 10Lucas Werkmeister (WMDE)) [09:31:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T399249)', diff saved to https://phabricator.wikimedia.org/P79801 and previous config saved to /var/cache/conftool/dbconfig/20250724-093158-marostegui.json [09:32:04] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:34:09] 07Puppet, 06SRE, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Puppet configures kernel.core_pattern |/usr/lib/systemd/systemd-coredump, but systemd-coredump is not installed - https://phabricator.wikimedia.org/T400247#11030334 (10jijiki) (cp from gerrit comment) We had attempted in the past to enabl... [09:34:28] (03CR) 10Clément Goubert: "Yeah, the linked task is abandoned as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [09:34:40] 07Puppet, 10Beta-Cluster-Infrastructure, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet configures kernel.core_pattern |/usr/lib/systemd/systemd-coredump, but systemd-coredump is not installed - https://phabricator.wikimedia.org/T400247#11030335 (10jijiki) p:05Triage→03Low [09:34:40] (03Abandoned) 10Clément Goubert: eventgate-logging-external: Remove CPU limit for tls-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [09:36:21] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [09:36:58] (03PS14) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [09:37:08] !log disable BGP for lvs1013 on lsw1-e1-eqiad.mgmt.eqiad.wmnet - T400259 [09:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:13] T400259: Stop using lvs1013 as a liberica canary - https://phabricator.wikimedia.org/T400259 [09:38:08] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:39:12] (03PS15) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [09:40:03] 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11030353 (10Tgr) See T392633#10776362 for a full list of session tokens. We plan to treat everything other than OAuth 2 and session cookies as anonymous for rate limiting purposes, so I imagine you don't care about validat... [09:41:28] (03CR) 10Fabfur: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:42:53] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [09:44:42] FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:47:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P79803 and previous config saved to /var/cache/conftool/dbconfig/20250724-094706-marostegui.json [09:53:14] jouncebot: nowandnext [09:53:14] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [09:53:15] In 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1000) [09:54:00] (03CR) 10Vgutierrez: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:54:10] !log hnowlan@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:55:22] !log hnowlan@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:56:25] (03CR) 10Clément Goubert: [C:03+2] wmnet: Remove maintenance.eqiad.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/1171983 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [09:57:14] !log cgoubert@dns1004 START - running authdns-update [09:57:22] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1013.eqiad.wmnet with OS bookworm [09:57:41] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [09:57:43] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [09:58:07] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync [09:58:11] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync [09:58:13] !log cgoubert@dns1004 END - running authdns-update [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1000) [10:01:36] (03PS16) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [10:01:54] (03CR) 10Fabfur: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [10:02:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P79804 and previous config saved to /var/cache/conftool/dbconfig/20250724-100213-marostegui.json [10:05:11] (03CR) 10AikoChou: ml-services: update RRLA and RRML images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [10:06:21] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334 (10dcaro) 03NEW [10:09:00] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11030433 (10dcaro) [10:12:20] (03PS2) 10Elukey: redfish: simplify change_user_password for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172265 (https://phabricator.wikimedia.org/T396365) [10:14:27] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172273 [10:17:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T399249)', diff saved to https://phabricator.wikimedia.org/P79805 and previous config saved to /var/cache/conftool/dbconfig/20250724-101721-marostegui.json [10:17:27] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:17:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:19:52] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11030440 (10dcaro) [10:24:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter2005:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [10:24:42] FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:31:21] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11030444 (10dcaro) Doing this upgrade, the mons crashed, the error they shown was about using an old mon... [10:34:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter2005:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [10:34:42] FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:37:18] (03CR) 10Volans: [C:03+1] "LGTM, the code could potentially benefit from some refactoring at this point, not a blocker." [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney) [10:41:19] (03CR) 10Vgutierrez: [C:03+1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [10:43:08] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11030459 (10dcaro) p:05Triage→03High [10:44:42] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:46:40] (03CR) 10Fabfur: [C:03+2] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [10:47:43] (03CR) 10Volans: Capirca: handle script having no 'status' attribute gracefully (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [10:49:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2149.codfw.wmnet with reason: Maintenance [10:49:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T399728)', diff saved to https://phabricator.wikimedia.org/P79806 and previous config saved to /var/cache/conftool/dbconfig/20250724-104938-fceratto.json [10:49:43] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:52:56] (03CR) 10Kevin Bazira: ml-services: update RRLA and RRML images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [10:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:58:00] (03PS1) 10Phuedx: MetricsPlatform: Disable synchronous configs fetching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172279 (https://phabricator.wikimedia.org/T398422) [10:58:09] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11030499 (10dcaro) The client on 2004 keeps getting connection refused: ` 148677 connect(12, {sa_family=... [11:04:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T399728)', diff saved to https://phabricator.wikimedia.org/P79807 and previous config saved to /var/cache/conftool/dbconfig/20250724-110412-fceratto.json [11:04:17] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:06:46] (03PS8) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [11:07:52] (03CR) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [11:08:22] (03PS9) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [11:08:56] (03PS10) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [11:09:03] (03PS11) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [11:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:03] (03PS6) 10Cathal Mooney: JunOS: pass ignore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) [11:18:58] (03CR) 10Cathal Mooney: JunOS: pass ignore_warnings list to diff() and rollback() functions (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney) [11:19:09] (03PS7) 10Cathal Mooney: JunOS: pass ignore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) [11:19:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P79808 and previous config saved to /var/cache/conftool/dbconfig/20250724-111919-fceratto.json [11:20:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:20:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T399249)', diff saved to https://phabricator.wikimedia.org/P79809 and previous config saved to /var/cache/conftool/dbconfig/20250724-112008-marostegui.json [11:20:14] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:20:25] (03CR) 10Volans: [C:03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney) [11:27:04] (03CR) 10CI reject: [V:04-1] Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [11:34:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P79810 and previous config saved to /var/cache/conftool/dbconfig/20250724-113427-fceratto.json [11:35:28] (03PS12) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [11:35:36] (03PS13) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [11:42:58] (03PS1) 10Gmodena: data-engineering: eventbus: increase anomaly detection threshold [alerts] - 10https://gerrit.wikimedia.org/r/1172280 (https://phabricator.wikimedia.org/T398437) [11:48:51] (03CR) 10Dr0ptp4kt: [C:03+2] data-engineering: eventbus: increase anomaly detection threshold [alerts] - 10https://gerrit.wikimedia.org/r/1172280 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena) [11:49:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T399728)', diff saved to https://phabricator.wikimedia.org/P79812 and previous config saved to /var/cache/conftool/dbconfig/20250724-114934-fceratto.json [11:49:40] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:49:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2156.codfw.wmnet with reason: Maintenance [11:49:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T399728)', diff saved to https://phabricator.wikimedia.org/P79813 and previous config saved to /var/cache/conftool/dbconfig/20250724-114957-fceratto.json [11:50:01] (03Merged) 10jenkins-bot: data-engineering: eventbus: increase anomaly detection threshold [alerts] - 10https://gerrit.wikimedia.org/r/1172280 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena) [11:50:54] (03CR) 10Bartosz Wójtowicz: "LGTM, thank you for this work Kevin! ❤️" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [11:51:05] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: update RRLA and RRML images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [11:58:36] 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344 (10diego) 03NEW [11:59:03] (03PS1) 10Btullis: Dumps: bump the mediawiki image deployed to the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172285 (https://phabricator.wikimedia.org/T394389) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1200) [12:00:18] (03CR) 10Volans: [C:03+1] "LGTM, would be nice to add a test for it ;)" [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [12:01:58] (03PS2) 10Btullis: Dumps: bump the mediawiki image deployed to the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172285 (https://phabricator.wikimedia.org/T394389) [12:02:10] (03CR) 10Cathal Mooney: [C:03+2] JunOS: pass ignore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney) [12:03:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T399249)', diff saved to https://phabricator.wikimedia.org/P79814 and previous config saved to /var/cache/conftool/dbconfig/20250724-120319-marostegui.json [12:03:26] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:04:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T399728)', diff saved to https://phabricator.wikimedia.org/P79815 and previous config saved to /var/cache/conftool/dbconfig/20250724-120422-fceratto.json [12:04:29] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:07:22] (03CR) 10Btullis: [C:03+2] Dumps: bump the mediawiki image deployed to the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172285 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [12:09:08] (03Merged) 10jenkins-bot: Dumps: bump the mediawiki image deployed to the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172285 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [12:18:07] (03Merged) 10jenkins-bot: JunOS: pass ignore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney) [12:18:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P79816 and previous config saved to /var/cache/conftool/dbconfig/20250724-121827-marostegui.json [12:19:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P79817 and previous config saved to /var/cache/conftool/dbconfig/20250724-121930-fceratto.json [12:26:42] (03CR) 10Cathal Mooney: [C:03+2] Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [12:29:28] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on gerrit2003.wikimedia.org with reason: maintenance [12:31:06] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on gerrit2002.wikimedia.org with reason: maintenance [12:32:41] 06SRE, 10Hiddenparma, 06Traffic: Browser behaviour detection at the edge - https://phabricator.wikimedia.org/T400270#11030895 (10Vgutierrez) We might need to perform some of this work on HAProxy given it has direct access to the client connection and its properties [12:32:54] (03PS5) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) [12:33:15] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the reviews! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [12:33:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P79818 and previous config saved to /var/cache/conftool/dbconfig/20250724-123334-marostegui.json [12:34:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P79819 and previous config saved to /var/cache/conftool/dbconfig/20250724-123437-fceratto.json [12:34:48] (03Merged) 10jenkins-bot: ml-services: update RRLA and RRML images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [12:35:59] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1013.eqiad.wmnet with OS trixie [12:39:08] (03Merged) 10jenkins-bot: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [12:40:42] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:42:30] (03CR) 10Gkyziridis: "Thnx for that smart catch. In this patch there are diffs in the console, although this is only the remove of the `maxReplicas`. Nothing r" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis) [12:47:15] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1013.eqiad.wmnet with reason: host reimage [12:48:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T399249)', diff saved to https://phabricator.wikimedia.org/P79820 and previous config saved to /var/cache/conftool/dbconfig/20250724-124842-marostegui.json [12:48:48] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:48:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [12:49:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T399249)', diff saved to https://phabricator.wikimedia.org/P79821 and previous config saved to /var/cache/conftool/dbconfig/20250724-124904-marostegui.json [12:49:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T399728)', diff saved to https://phabricator.wikimedia.org/P79822 and previous config saved to /var/cache/conftool/dbconfig/20250724-124944-fceratto.json [12:49:50] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:49:55] (03PS1) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172313 [12:50:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2177.codfw.wmnet with reason: Maintenance [12:50:17] (03CR) 10Lucas Werkmeister (WMDE): "Okay… should I just abandon this patch then? (IIUC, the status quo is that mw-on-k8s VMs and most other machines drop core dumps into file" [puppet] - 10https://gerrit.wikimedia.org/r/1172010 (https://phabricator.wikimedia.org/T400247) (owner: 10Lucas Werkmeister (WMDE)) [12:50:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T399728)', diff saved to https://phabricator.wikimedia.org/P79823 and previous config saved to /var/cache/conftool/dbconfig/20250724-125017-fceratto.json [12:53:20] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1013.eqiad.wmnet with reason: host reimage [12:56:16] (03PS10) 10Stang: zhwiki: Allow local securepoll setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) [12:57:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11031015 (10Jclark-ctr) @cmooney @ ayounsi This morning, I updated NetBox with names and locations for all refresh switches and ran two new console cables to SCS. I also v... [12:57:58] (03PS2) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172313 [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1300). [13:00:05] danisztls: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] (03CR) 10Stang: zhwiki: Allow local securepoll setup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [13:00:50] o/ [13:01:05] o/ [13:02:04] I can deploy ^^ [13:04:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170760 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [13:04:19] (03Abandoned) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172313 (owner: 10Cathal Mooney) [13:04:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T399728)', diff saved to https://phabricator.wikimedia.org/P79824 and previous config saved to /var/cache/conftool/dbconfig/20250724-130441-fceratto.json [13:04:47] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:05:17] (03Merged) 10jenkins-bot: Deploy Readers Use Cases Survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170760 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [13:05:54] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1170760|Deploy Readers Use Cases Survey v2 (T399736)]] [13:05:59] T399736: Open-ended survey of English Wikipedia readers v2 - https://phabricator.wikimedia.org/T399736 [13:06:01] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11031044 (10elukey) @Mvolz I created another set of graphs: https://w.wiki/Eq8o. Note that they are all DC agnostic, since we could/merge eqiad|co... [13:08:28] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:09:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:09:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1013.eqiad.wmnet with OS trixie [13:09:57] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, dani: Backport for [[gerrit:1170760|Deploy Readers Use Cases Survey v2 (T399736)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:10:08] danisztls: can you test the change? [13:10:11] Lucas_WMDE: since this change just bumps the coverage I already tested what could be tested [13:10:14] ok [13:10:43] yeah hitting mwdebug with 400+ requests just to get the survey is probably not a useful use of server or human time [13:10:53] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, dani: Continuing with sync [13:11:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11031062 (10elukey) Reimaged ml-serve1013 with Trixie: ` [13:13:36] Lucas_WMDE: yeah, could write a script for that but it would be pointless [13:14:25] Lucas_WMDE: thanks for deploying [13:15:58] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11031085 (10dcaro) I was able to get the mon working by disabling cephx on the config, and only setting... [13:18:01] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170760|Deploy Readers Use Cases Survey v2 (T399736)]] (duration: 12m 07s) [13:18:08] T399736: Open-ended survey of English Wikipedia readers v2 - https://phabricator.wikimedia.org/T399736 [13:19:05] !log UTC afternoon backport+config window done [13:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:17] (03PS1) 10Marostegui: mariadb: Add db224[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1172319 (https://phabricator.wikimedia.org/T400213) [13:19:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P79826 and previous config saved to /var/cache/conftool/dbconfig/20250724-131949-fceratto.json [13:22:42] (03CR) 10Marostegui: [C:03+2] mariadb: Add db224[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1172319 (https://phabricator.wikimedia.org/T400213) (owner: 10Marostegui) [13:23:54] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11031099 (10Marostegui) Patches are done [13:24:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11031100 (10Marostegui) [13:32:16] (03CR) 10Novem Linguae: [C:03+1] "Code looks good. Overall, this is very similar to enwiki's setup, with one difference being that they are creating a scrutineer group for " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [13:34:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T399249)', diff saved to https://phabricator.wikimedia.org/P79828 and previous config saved to /var/cache/conftool/dbconfig/20250724-133439-marostegui.json [13:34:45] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:34:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P79829 and previous config saved to /var/cache/conftool/dbconfig/20250724-133456-fceratto.json [13:35:11] (03PS1) 10Dreamy Jazz: Make TSP extensions have warning logs in logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172323 [13:35:17] jouncebot: nowandnext [13:35:17] For the next 0 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1300) [13:35:17] In 0 hour(s) and 54 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1430) [13:36:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172323 (owner: 10Dreamy Jazz) [13:37:05] (03Merged) 10jenkins-bot: Make TSP extensions have warning logs in logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172323 (owner: 10Dreamy Jazz) [13:37:25] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1172323|Make TSP extensions have warning logs in logstash]] [13:39:38] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1172323|Make TSP extensions have warning logs in logstash]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:40:57] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [13:41:57] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11031179 (10dcaro) with this, I added a few of the config values back: ` root@cloudcephmon2004-dev:~# ce... [13:43:24] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search,name=codfw [13:46:17] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172323|Make TSP extensions have warning logs in logstash]] (duration: 08m 51s) [13:49:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P79830 and previous config saved to /var/cache/conftool/dbconfig/20250724-134946-marostegui.json [13:50:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T399728)', diff saved to https://phabricator.wikimedia.org/P79831 and previous config saved to /var/cache/conftool/dbconfig/20250724-135004-fceratto.json [13:50:09] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:50:20] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2190.codfw.wmnet with reason: Maintenance [13:50:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T399728)', diff saved to https://phabricator.wikimedia.org/P79832 and previous config saved to /var/cache/conftool/dbconfig/20250724-135027-fceratto.json [13:51:54] (03CR) 10Xcollazo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [14:04:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P79833 and previous config saved to /var/cache/conftool/dbconfig/20250724-140454-marostegui.json [14:05:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T399728)', diff saved to https://phabricator.wikimedia.org/P79834 and previous config saved to /var/cache/conftool/dbconfig/20250724-140519-fceratto.json [14:05:25] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:09:54] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:10:41] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:11:04] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11031334 (10Jhancock.wm) @Marostegui today or tomorrow is fine. [14:12:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11031341 (10Jhancock.wm) @jhathaway i can take care of this connection today. Can you remind me what the hostname of this server is? [14:15:50] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11031357 (10Jhancock.wm) @cirrussearch2091 john took a look at it for me and it looks like he was able to get it... [14:16:29] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11031359 (10Scott_French) [14:18:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11031384 (10jhathaway) >>! In T400211#11031335, @Jhancock.wm wrote: > @jhathaway i can take care of this connection today. Can you remind me what the host... [14:20:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T399249)', diff saved to https://phabricator.wikimedia.org/P79836 and previous config saved to /var/cache/conftool/dbconfig/20250724-142001-marostegui.json [14:20:07] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:20:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [14:20:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T399249)', diff saved to https://phabricator.wikimedia.org/P79837 and previous config saved to /var/cache/conftool/dbconfig/20250724-142024-marostegui.json [14:20:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P79838 and previous config saved to /var/cache/conftool/dbconfig/20250724-142033-fceratto.json [14:22:41] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11031413 (10bking) @Jhancock.wm sure, can y'all try installing Bullseye on it? I only switched it to UEFI because... [14:22:51] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#11031414 (10Jhancock.wm) @bking i rebooted the idrac on this server. It's about all i can do for the moment. is there a time we can depool this server... [14:24:38] !log bking@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cirrussearch2079.codfw.wmnet with reason: T396718 [14:24:43] T396718: cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718 [14:24:54] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [14:25:01] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11031424 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host c... [14:25:48] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#11031430 (10bking) @Jhancock.wm Sorry we did not respond to this one sooner. The host is downtimed and depooled, feel free to reboot whenever. [14:27:07] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11031448 (10Scott_French) No further manual clean-up actions are currently planned, though there will be various spot fixes as teams update their build... [14:27:19] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11031449 (10Scott_French) 05Open→03Resolved [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1430) [14:33:01] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:35:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P79839 and previous config saved to /var/cache/conftool/dbconfig/20250724-143541-fceratto.json [14:41:10] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2091.codfw.wmnet with reason: host reimage [14:44:09] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2091.codfw.wmnet with reason: host reimage [14:44:57] FIRING: JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:30] (03PS1) 10Federico Ceratto: zarcillo: Enable egress for Alertmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172334 (https://phabricator.wikimedia.org/T384810) [14:48:30] (03CR) 10Federico Ceratto: "A simple change to enable egress to Alertmanager" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172334 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [14:50:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T399728)', diff saved to https://phabricator.wikimedia.org/P79840 and previous config saved to /var/cache/conftool/dbconfig/20250724-145048-fceratto.json [14:50:54] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:51:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2194.codfw.wmnet with reason: Maintenance [14:51:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T399728)', diff saved to https://phabricator.wikimedia.org/P79841 and previous config saved to /var/cache/conftool/dbconfig/20250724-145112-fceratto.json [14:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170616 (owner: 10Krinkle) [14:55:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170614 (owner: 10Krinkle) [14:56:28] (03Merged) 10jenkins-bot: build: Fix failing `phpcs` in CI on commits updating interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170616 (owner: 10Krinkle) [14:56:36] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170614 (owner: 10Krinkle) [14:56:39] (03CR) 10Clément Goubert: [C:03+1] "LGTM, leaving to o11y if we're ok with reaching out to AM from k8s (may need some firewall rule on AM side?)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172334 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [14:56:58] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1170616|build: Fix failing `phpcs` in CI on commits updating interwiki.php]], [[gerrit:1170614|Update interwiki cache]] [14:59:08] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1170616|build: Fix failing `phpcs` in CI on commits updating interwiki.php]], [[gerrit:1170614|Update interwiki cache]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:00:05] dduvall and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1500). [15:00:24] !log krinkle@deploy1003 krinkle: Continuing with sync [15:01:16] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2091.codfw.wmnet with OS bullseye [15:01:27] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11031690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cirru... [15:05:47] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170616|build: Fix failing `phpcs` in CI on commits updating interwiki.php]], [[gerrit:1170614|Update interwiki cache]] (duration: 08m 48s) [15:06:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T399728)', diff saved to https://phabricator.wikimedia.org/P79842 and previous config saved to /var/cache/conftool/dbconfig/20250724-150605-fceratto.json [15:06:11] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:06:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T399249)', diff saved to https://phabricator.wikimedia.org/P79843 and previous config saved to /var/cache/conftool/dbconfig/20250724-150622-marostegui.json [15:06:28] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:06:40] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:06:42] quiddity: Interwiki for mediawiki.org in Parsoid/VisualEditor now defaults to mw: again like before :) [15:09:05] (03CR) 10JHathaway: [C:03+1] redfish: simplify change_user_password for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172265 (https://phabricator.wikimedia.org/T396365) (owner: 10Elukey) [15:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:42] FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:45] (03PS2) 10Btullis: dns: Add a VIP for dse-k8s-ctrl.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1171592 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [15:10:41] (03CR) 10Btullis: [C:03+1] dns: Add a VIP for dse-k8s-ctrl.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1171592 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [15:11:27] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11031731 (10Jhancock.wm) @jhathaway that one took. Still working with Dell on the other server. I'll let you know... [15:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:13:06] !log reprepro include php-xhprof_2.3.10-1+wmf11u1 in component/php81 - T398245 [15:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:12] T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245 [15:13:52] !log reprepro include php-xhprof_2.3.10-1+wmf11u1 tideways_5.0.4-16+wmf11u2 in component/php83 - T398245 [15:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:42] FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:21:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P79844 and previous config saved to /var/cache/conftool/dbconfig/20250724-152113-fceratto.json [15:21:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P79845 and previous config saved to /var/cache/conftool/dbconfig/20250724-152129-marostegui.json [15:28:49] (03CR) 10JHathaway: sre.hosts.provision: add custom settings for Supermicro (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [15:31:20] (03CR) 10Elukey: sre.hosts.provision: add custom settings for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [15:33:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11031851 (10Jhancock.wm) I'm not sure for the wording and want to clarify. Are you ordering the items needed? Or do we need to start a procurement task? [15:34:50] (03PS6) 10Ilias Sarantopoulos: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis) [15:36:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P79846 and previous config saved to /var/cache/conftool/dbconfig/20250724-153620-fceratto.json [15:36:29] (03CR) 10Ilias Sarantopoulos: "I've updated the patch to use the old schema for now so that we can go ahead and deploy this and we can figure out the new annotations aft" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis) [15:36:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P79847 and previous config saved to /var/cache/conftool/dbconfig/20250724-153637-marostegui.json [15:37:17] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2079'] [15:37:35] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cirrussearch2079'] [15:39:13] (03CR) 10Ilias Sarantopoulos: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis) [15:44:37] (03PS1) 10Fabfur: haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) [15:45:45] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis) [15:46:04] (03CR) 10CI reject: [V:04-1] haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [15:46:45] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [15:47:27] (03CR) 10Ozge: [C:03+1] ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis) [15:48:09] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [15:48:22] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [15:51:06] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2079'] [15:51:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T399728)', diff saved to https://phabricator.wikimedia.org/P79848 and previous config saved to /var/cache/conftool/dbconfig/20250724-155128-fceratto.json [15:51:33] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:51:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2209.codfw.wmnet with reason: Maintenance [15:51:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T399249)', diff saved to https://phabricator.wikimedia.org/P79849 and previous config saved to /var/cache/conftool/dbconfig/20250724-155144-marostegui.json [15:51:51] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:51:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T399728)', diff saved to https://phabricator.wikimedia.org/P79850 and previous config saved to /var/cache/conftool/dbconfig/20250724-155151-fceratto.json [15:51:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [15:52:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T399249)', diff saved to https://phabricator.wikimedia.org/P79851 and previous config saved to /var/cache/conftool/dbconfig/20250724-155206-marostegui.json [15:53:28] (03PS1) 10Btullis: Bump the flink-operator image to version 1.12.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172351 (https://phabricator.wikimedia.org/T398162) [15:55:59] (03CR) 10Btullis: "This will require a corresponding change in deployment-charts to do the following:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172351 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis) [15:57:21] 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11031974 (10fnegri) [15:58:21] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2079'] [15:58:49] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#11031980 (10Jhancock.wm) a:03Jhancock.wm updated the idrac firmware manually. then tested a run of firmware update script to see if it would connect.... [15:59:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11031986 (10jhathaway) >>! In T400211#11031851, @Jhancock.wm wrote: > I'm not sure for the wording and want to clarify. Are you ordering the items needed?... [16:00:05] jhathaway and moritzm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:48] (03CR) 10JHathaway: sre.hosts.provision: add custom settings for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [16:00:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11031990 (10Jhancock.wm) @RobH we can probably discuss this during our meeting this afternoon. Need to order the two items linked in the description. [16:04:55] (03CR) 10Gmodena: [C:03+1] Bump the flink-operator image to version 1.12.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172351 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis) [16:06:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T399728)', diff saved to https://phabricator.wikimedia.org/P79852 and previous config saved to /var/cache/conftool/dbconfig/20250724-160643-fceratto.json [16:06:46] !log dancy@deploy1003 Installing scap version "4.190.0" for 2 host(s) [16:06:49] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:07:44] (03PS1) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172352 [16:08:32] !log dancy@deploy1003 Installation of scap version "4.190.0" completed for 2 hosts [16:21:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P79854 and previous config saved to /var/cache/conftool/dbconfig/20250724-162150-fceratto.json [16:22:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [16:24:15] (03CR) 10Cathal Mooney: [C:03+2] CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172352 (owner: 10Cathal Mooney) [16:27:09] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1024.eqiad.wmnet with OS bookworm [16:32:01] should we worry about that Thumbor alert? [16:32:06] Emperor: ^^? [16:33:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance [16:34:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2036 T399927', diff saved to https://phabricator.wikimedia.org/P79855 and previous config saved to /var/cache/conftool/dbconfig/20250724-163439-root.json [16:34:46] T399927: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927 [16:35:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T399249)', diff saved to https://phabricator.wikimedia.org/P79856 and previous config saved to /var/cache/conftool/dbconfig/20250724-163553-marostegui.json [16:35:58] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [16:36:21] looks like there's a bad thumbor pod [16:36:22] looking [16:36:54] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172352 (owner: 10Cathal Mooney) [16:36:57] vgutierrez: FTR, I'm not a thumbor expert (but it looks like one has appeared :) ) [16:36:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P79857 and previous config saved to /var/cache/conftool/dbconfig/20250724-163658-fceratto.json [16:37:12] hnowlan: thx <3 [16:38:10] Krinkle: Thank you for the fix and for letting me know! [16:39:24] pod cordoned, error rate will hopefully dip a little - that alert is a little misleading as it's a high error rate for a single pod rather than the fleet aiui [16:39:42] 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11032190 (10jhathaway) @KFrancis would you kindly confirm that @Novem_Linguae has signed the NDA? [16:40:29] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11032194 (10Marostegui) >>! In T399927#11031334, @Jhancock.wm wrote: > @Marostegui today or tomorrow is fine. es2036 is ready for you [16:41:02] 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11032199 (10Novem_Linguae) I haven't signed it yet. I'm happy to do so. Just need instructions on how to get that started. [16:42:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [16:43:04] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1024.eqiad.wmnet with reason: host reimage [16:44:58] 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11032207 (10jhathaway) [16:48:51] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1024.eqiad.wmnet with reason: host reimage [16:49:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11032225 (10RobH) I'm a bit surprised the server doesn't have the db connection cable, but we can order that from SM store or Rich@SM via quotation. Then... [16:50:47] 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11032232 (10jhathaway) @Miriam would you kindly approve @diego's request to be added to analytics-research-admins? [16:51:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P79858 and previous config saved to /var/cache/conftool/dbconfig/20250724-165100-marostegui.json [16:52:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T399728)', diff saved to https://phabricator.wikimedia.org/P79859 and previous config saved to /var/cache/conftool/dbconfig/20250724-165205-fceratto.json [16:52:10] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:52:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2227.codfw.wmnet with reason: Maintenance [16:52:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2227 (T399728)', diff saved to https://phabricator.wikimedia.org/P79860 and previous config saved to /var/cache/conftool/dbconfig/20250724-165228-fceratto.json [16:53:30] !log delete thumbor pod where all instances displayed signs of T374350 [16:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:35] T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error - https://phabricator.wikimedia.org/T374350 [16:55:39] (03PS1) 10Daimona Eaytoy: Enable the CampaignEvents extension on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172362 (https://phabricator.wikimedia.org/T397369) [16:56:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172362 (https://phabricator.wikimedia.org/T397369) (owner: 10Daimona Eaytoy) [16:56:28] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 83096MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [17:00:05] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1700) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1700) [17:00:13] 06SRE, 10SRE-Access-Requests, 06SRE Observability: Logstash access for HCoplin - https://phabricator.wikimedia.org/T400288#11032270 (10jhathaway) @HCoplin-WMF happy to help grant you access. You may be able to request access through our new IDM tool, https://idm.wikimedia.org. Can you try logging in and requ... [17:03:25] looks like developer portal should release but also like the base image there needs some attention to fix the container builds. I'll start poking things [17:06:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P79862 and previous config saved to /var/cache/conftool/dbconfig/20250724-170608-marostegui.json [17:07:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T399728)', diff saved to https://phabricator.wikimedia.org/P79863 and previous config saved to /var/cache/conftool/dbconfig/20250724-170719-fceratto.json [17:07:25] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [17:14:10] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [17:16:54] 10SRE-SLO, 10Observability-Metrics: Clear & Backfill Tonecheck Pyrra Metrics - https://phabricator.wikimedia.org/T400071#11032344 (10herron) Trying today with an ad-hoc prometheus instance to compact the overlapping blocks before uploading Generate the backfill blocks again ` /tmp/backfill/tonecheck$ time p... [17:16:55] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:17:10] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2036 [17:17:21] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2036 [17:21:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T399249)', diff saved to https://phabricator.wikimedia.org/P79864 and previous config saved to /var/cache/conftool/dbconfig/20250724-172117-marostegui.json [17:21:22] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:21:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:21:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T399249)', diff saved to https://phabricator.wikimedia.org/P79865 and previous config saved to /var/cache/conftool/dbconfig/20250724-172140-marostegui.json [17:22:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P79866 and previous config saved to /var/cache/conftool/dbconfig/20250724-172227-fceratto.json [17:23:36] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:26:54] (03CR) 10Cwhite: [C:03+2] opensearch: curator instance config to follow $enable_curator [puppet] - 10https://gerrit.wikimedia.org/r/1171713 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite) [17:27:20] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11032375 (10Marostegui) es2036 done ` [ 21.582858] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 10000 Mbps (NRZ) full duplex, Flow control: none [ 2... [17:27:33] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11032376 (10Marostegui) [17:28:24] (03PS1) 10Volans: insetup role report: update recipients [puppet] - 10https://gerrit.wikimedia.org/r/1172365 [17:34:21] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1024.eqiad.wmnet with OS bookworm [17:37:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P79867 and previous config saved to /var/cache/conftool/dbconfig/20250724-173734-fceratto.json [17:38:30] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1025.eqiad.wmnet with OS bookworm [17:42:40] 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11032404 (10KFrancis) Hi @Novem_Linguae, please send your legal name, postal address, and email to kfrancis@wikimedia.org and I will put the NDA together for you to sign. Thanks! [17:47:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P79868 and previous config saved to /var/cache/conftool/dbconfig/20250724-174752-root.json [17:47:55] (03CR) 10Volans: "Updated reviewers" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [17:48:09] (03CR) 10Volans: "Updated reviewers" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [17:50:39] (03CR) 10Volans: [C:03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/993797 (owner: 10JHathaway) [17:52:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T399728)', diff saved to https://phabricator.wikimedia.org/P79869 and previous config saved to /var/cache/conftool/dbconfig/20250724-175242-fceratto.json [17:52:48] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [17:52:49] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1025.eqiad.wmnet with reason: host reimage [17:52:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2239.codfw.wmnet with reason: Maintenance [17:57:03] (03CR) 10Volans: "@fceratto@wikimedia.org" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans) [17:57:33] (03CR) 10Volans: "[update]" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888 (owner: 10Volans) [17:58:57] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1025.eqiad.wmnet with reason: host reimage [18:00:04] dduvall and dancy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1800). [18:02:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79870 and previous config saved to /var/cache/conftool/dbconfig/20250724-180258-root.json [18:03:51] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172372 (https://phabricator.wikimedia.org/T396372) [18:03:53] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172372 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot) [18:04:47] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172372 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot) [18:06:19] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:07:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T399249)', diff saved to https://phabricator.wikimedia.org/P79871 and previous config saved to /var/cache/conftool/dbconfig/20250724-180758-marostegui.json [18:08:04] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:12:40] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.11 refs T396372 [18:12:45] T396372: 1.45.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T396372 [18:13:05] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:15:59] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1022.eqiad.wmnet with OS bookworm [18:16:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11032454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm [18:18:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79872 and previous config saved to /var/cache/conftool/dbconfig/20250724-181803-root.json [18:22:31] train is clear [18:23:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P79873 and previous config saved to /var/cache/conftool/dbconfig/20250724-182306-marostegui.json [18:29:44] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1025.eqiad.wmnet with OS bookworm [18:32:13] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402 (10RobH) 03NEW [18:33:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79874 and previous config saved to /var/cache/conftool/dbconfig/20250724-183309-root.json [18:33:16] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11032563 (10RobH) [18:38:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P79875 and previous config saved to /var/cache/conftool/dbconfig/20250724-183813-marostegui.json [18:45:47] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1030.eqiad.wmnet with OS bookworm [18:48:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79876 and previous config saved to /var/cache/conftool/dbconfig/20250724-184815-root.json [18:51:32] 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11032595 (10jhathaway) @ttaylor when you have a moment, please review and sign the L3 server access document. [18:51:47] vriley@cumin1002 reimage (PID 2912686) is awaiting input [18:52:03] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1022.eqiad.wmnet with OS bookworm [18:52:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11032596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with... [18:53:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T399249)', diff saved to https://phabricator.wikimedia.org/P79877 and previous config saved to /var/cache/conftool/dbconfig/20250724-185320-marostegui.json [18:53:27] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:53:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance [18:53:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T399249)', diff saved to https://phabricator.wikimedia.org/P79878 and previous config saved to /var/cache/conftool/dbconfig/20250724-185343-marostegui.json [18:54:06] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1022.eqiad.wmnet with OS bookworm [18:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:54:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11032607 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm [19:03:12] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1030.eqiad.wmnet with reason: host reimage [19:06:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405 (10SD0001) 03NEW [19:08:39] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1030.eqiad.wmnet with reason: host reimage [19:08:55] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11032662 (10SD0001) [19:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:22] 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11032667 (10ttaylor) @jhathaway done! [19:19:57] FIRING: JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:27:12] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11032702 (10MusikAnimal) Confirming my sponsorship as WMF staff. SD0001 has done fantastic work in improving the performance of DB queries that can't be tested by other means su... [19:29:42] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:33:36] 06SRE: FY 25/26 WE 5.4.2: Known bots / clients - https://phabricator.wikimedia.org/T400100#11032715 (10Scott_French) [19:34:42] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:42:20] vriley@cumin1002 reimage (PID 2954117) is awaiting input [19:42:36] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1022.eqiad.wmnet with OS bookworm [19:42:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11032733 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with... [19:44:09] (03PS1) 10BCornwall: Add wikipedialibrary.org to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1172389 [19:44:47] (03PS2) 10BCornwall: Add wikipedialibrary.org to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1172389 (https://phabricator.wikimedia.org/T400367) [19:45:02] 06SRE: FY 25/26 WE 5.4.2: Known bots / clients - https://phabricator.wikimedia.org/T400100#11032737 (10Scott_French) [19:45:09] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1030.eqiad.wmnet with OS bookworm [19:47:13] (03PS1) 10Dzahn: add blubber builder config to build zoekt [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) [19:47:29] (03CR) 10CI reject: [V:04-1] add blubber builder config to build zoekt [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [19:47:47] (03PS1) 10BCornwall: ncredir: Redirect wikipedialibrary.org [puppet] - 10https://gerrit.wikimedia.org/r/1172391 (https://phabricator.wikimedia.org/T400367) [19:48:39] (03PS1) 10BCornwall: acme-chief: Add wikipedialibrary.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1172393 (https://phabricator.wikimedia.org/T400367) [19:49:36] 06SRE: FY 25/26 WE 5.4.2: Known bots / clients - https://phabricator.wikimedia.org/T400100#11032749 (10Scott_French) [19:51:13] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6416/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172393 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall) [19:55:29] (03CR) 10Dzahn: [C:03+1] Add wikipedialibrary.org to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1172389 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall) [19:56:10] (03CR) 10Dzahn: [C:03+1] acme-chief: Add wikipedialibrary.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1172393 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall) [19:57:40] (03CR) 10BCornwall: [C:03+2] Add wikipedialibrary.org to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1172389 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall) [19:57:46] (03CR) 10BCornwall: [V:03+1 C:03+2] acme-chief: Add wikipedialibrary.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1172393 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall) [19:58:59] !log brett@dns1004 START - running authdns-update [19:59:19] (03CR) 10Dzahn: [V:04-1] "#15 ERROR: "/go.sum" not found: not found" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [19:59:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T399249)', diff saved to https://phabricator.wikimedia.org/P79879 and previous config saved to /var/cache/conftool/dbconfig/20250724-195958-marostegui.json [19:59:59] !log brett@dns1004 END - running authdns-update [20:00:03] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T2000). [20:00:04] Daimona: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:23] o/ [20:00:25] (03CR) 10Dzahn: [V:04-1] "upstream code I am looking at for this is line 25 - 32 in https://github.com/sourcebot-dev/sourcebot/blob/main/Dockerfile" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [20:00:53] I can deploy [20:01:22] (03CR) 10Zabe: [C:03+2] Enable the CampaignEvents extension on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172362 (https://phabricator.wikimedia.org/T397369) (owner: 10Daimona Eaytoy) [20:01:34] Thank you! [20:02:14] (03Merged) 10jenkins-bot: Enable the CampaignEvents extension on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172362 (https://phabricator.wikimedia.org/T397369) (owner: 10Daimona Eaytoy) [20:02:39] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1172362|Enable the CampaignEvents extension on wikimaniawiki (T397369)]] [20:02:44] T397369: Enable CampaignEvents Extension on wikimania - https://phabricator.wikimedia.org/T397369 [20:03:16] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1031.eqiad.wmnet with OS bookworm [20:04:36] !log zabe@deploy1003 zabe, daimona: Backport for [[gerrit:1172362|Enable the CampaignEvents extension on wikimaniawiki (T397369)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:04:46] Daimona: can you test? [20:06:41] Looks good aside from the usual caching issues like unavailable RL modules [20:06:59] !log zabe@deploy1003 zabe, daimona: Continuing with sync [20:12:26] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172362|Enable the CampaignEvents extension on wikimaniawiki (T397369)]] (duration: 09m 46s) [20:12:31] T397369: Enable CampaignEvents Extension on wikimania - https://phabricator.wikimedia.org/T397369 [20:12:35] Daimona: should be live [20:13:24] Can confirm, thank you! [20:13:32] yw [20:13:33] (03CR) 10Fabfur: [C:03+1] ncredir: Redirect wikipedialibrary.org [puppet] - 10https://gerrit.wikimedia.org/r/1172391 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall) [20:14:08] (03PS1) 10DLynch: Enable DiscussionTools thanks on existing "report incident" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172397 [20:15:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P79880 and previous config saved to /var/cache/conftool/dbconfig/20250724-201506-marostegui.json [20:16:27] (03CR) 10Dzahn: "wmflabs.org is deprecated and wmcloud.org should be used instead. let's see if they published this somewhere already or can easily change " [puppet] - 10https://gerrit.wikimedia.org/r/1172391 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall) [20:18:03] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1031.eqiad.wmnet with reason: host reimage [20:23:21] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1031.eqiad.wmnet with reason: host reimage [20:24:30] (03CR) 10BCornwall: [V:03+1] acme-chief: Add wikipedialibrary.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1172393 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall) [20:26:39] (03PS1) 10CDanis: haproxy: scrub part of x-analytics even when xwd debug [puppet] - 10https://gerrit.wikimedia.org/r/1172401 [20:26:39] (03PS1) 10CDanis: varnish: include wmfuniq count in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1172402 [20:28:09] (03PS2) 10DLynch: Enable DiscussionTools thanks on existing "report incident" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172397 (https://phabricator.wikimedia.org/T366095) [20:28:28] (03PS2) 10CDanis: haproxy: scrub part of x-analytics even when xwd debug [puppet] - 10https://gerrit.wikimedia.org/r/1172401 [20:28:28] (03PS2) 10CDanis: varnish: include wmfuniq count in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1172402 [20:28:32] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1172401 (owner: 10CDanis) [20:28:36] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1172402 (owner: 10CDanis) [20:30:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P79881 and previous config saved to /var/cache/conftool/dbconfig/20250724-203013-marostegui.json [20:45:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T399249)', diff saved to https://phabricator.wikimedia.org/P79882 and previous config saved to /var/cache/conftool/dbconfig/20250724-204521-marostegui.json [20:45:26] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [20:45:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1253.eqiad.wmnet with reason: Maintenance [20:45:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1253 (T399249)', diff saved to https://phabricator.wikimedia.org/P79883 and previous config saved to /var/cache/conftool/dbconfig/20250724-204543-marostegui.json [20:50:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [20:50:48] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1031.eqiad.wmnet with OS bookworm [20:52:29] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412 (10RobH) 03NEW [20:53:17] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11032928 (10RobH) Please note Jaime already merged the puppet changes needed for this host, so not assigning to them for that just leaving in the racking task column on #ops-eqiad for when... [21:00:02] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1032.eqiad.wmnet with OS bookworm [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T2100) [21:12:52] (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169198 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [21:13:41] (03Merged) 10jenkins-bot: Set categorylinks to read new on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169198 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [21:14:08] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169198|Set categorylinks to read new on most wikis (T397912)]] [21:14:12] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [21:16:06] !log zabe@deploy1003 zabe: Backport for [[gerrit:1169198|Set categorylinks to read new on most wikis (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:17:52] !log zabe@deploy1003 zabe: Continuing with sync [21:18:11] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1032.eqiad.wmnet with reason: host reimage [21:23:19] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1032.eqiad.wmnet with reason: host reimage [21:23:36] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169198|Set categorylinks to read new on most wikis (T397912)]] (duration: 09m 28s) [21:23:40] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [21:25:07] 06SRE, 10SRE-Access-Requests, 06SRE Observability: Logstash access for HCoplin - https://phabricator.wikimedia.org/T400288#11033037 (10HCoplin-WMF) I was indeed able to log in and request it that way! Thank you :) Apologies for the double request here, if that's y'all's preferred method. You might want to... [21:29:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T399249)', diff saved to https://phabricator.wikimedia.org/P79884 and previous config saved to /var/cache/conftool/dbconfig/20250724-212916-marostegui.json [21:29:21] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:44:08] jouncebot: nowandnext [21:44:08] For the next 0 hour(s) and 15 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T2100) [21:44:08] In 8 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250725T0600) [21:44:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P79885 and previous config saved to /var/cache/conftool/dbconfig/20250724-214423-marostegui.json [21:51:21] !log dreamyjazz Deployed security patch for T399093 [21:59:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P79886 and previous config saved to /var/cache/conftool/dbconfig/20250724-215931-marostegui.json [22:00:28] !log dancy@deploy1003 Installing scap version "4.191.0" for 2 host(s) [22:02:18] !log dancy@deploy1003 Installation of scap version "4.191.0" completed for 2 hosts [22:04:51] !log dreamyjazz@deploy1003 Started scap sync-world: (no justification provided) [22:07:56] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:08:23] !log dreamyjazz@deploy1003 Finished scap sync-world: (no justification provided) (duration: 03m 36s) [22:09:28] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10Mail: Access Request to DMarcDigests - https://phabricator.wikimedia.org/T399976#11033187 (10nisrael) Just confirming for visibility here, I was able to gain access! [22:12:28] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:14:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T399249)', diff saved to https://phabricator.wikimedia.org/P79887 and previous config saved to /var/cache/conftool/dbconfig/20250724-221439-marostegui.json [22:14:44] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:14:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [22:15:15] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1022.eqiad.wmnet with OS bookworm [22:15:26] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1032.eqiad.wmnet with OS bookworm [22:15:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11033199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm [22:21:53] (03PS1) 10Dreamy Jazz: Make SecurePoll channel log warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172422 [22:22:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172422 (owner: 10Dreamy Jazz) [22:22:35] Security deploy failed to apply properly. Using a scap backport to reset the patches back to before. [22:23:04] (03Merged) 10jenkins-bot: Make SecurePoll channel log warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172422 (owner: 10Dreamy Jazz) [22:23:16] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1172422|Make SecurePoll channel log warnings]] [22:25:16] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1172422|Make SecurePoll channel log warnings]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:26:05] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [22:27:44] 06SRE, 10WMF-General-or-Unknown: Outdated link in message that is sent for forbidden connections - https://phabricator.wikimedia.org/T400421#11033218 (10Pppery) [22:31:19] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172422|Make SecurePoll channel log warnings]] (duration: 08m 03s) [22:36:03] (03CR) 10Clare Ming: [C:03+1] "just fyi, i think web team is running an a/a test on testwiki so they may be relying on these configs for data collection -- depending on " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172279 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [22:42:03] !log dreamyjazz Deployed security patch for T399093 [22:44:21] (03CR) 10ZhaoFJx: [C:03+1] zhwiki: Allow local securepoll setup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [22:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:03:29] vriley@cumin1002 reimage (PID 3112741) is awaiting input [23:04:02] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1022.eqiad.wmnet with OS bookworm [23:04:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11033279 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with... [23:04:38] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:06:54] 06SRE, 06serviceops, 10Wikifeeds: 504 responses (gateway timeout) for /api/rest_v1/feed/featured - https://phabricator.wikimedia.org/T400425 (10Scott_French) 03NEW [23:09:03] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:10:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [23:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:11:47] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1022.eqiad.wmnet with OS bookworm [23:11:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11033292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm [23:12:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:12:17] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:13:24] 06SRE, 06serviceops, 10Wikifeeds: 504 responses (gateway timeout) for /api/rest_v1/feed/featured - https://phabricator.wikimedia.org/T400425#11033296 (10Scott_French) [23:13:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:14:07] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:27:54] vriley@cumin1002 reimage (PID 3119536) is awaiting input [23:34:27] 06SRE, 06Traffic: Outdated link in message that is sent for forbidden connections - https://phabricator.wikimedia.org/T400421#11033311 (10bd808) `lang=shell-session $ git grep https://meta.wikimedia.org/wiki/User-Agent_policy modules/profile/files/wmcs/services/maintain_dbusers/maintain_dbusers.py: # https:... [23:37:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172433 [23:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172433 (owner: 10TrainBranchBot) [23:42:34] (03PS1) 10BryanDavis: wmcs: Update URL in comment in maintain_dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/1172434 (https://phabricator.wikimedia.org/T400421) [23:42:36] (03PS1) 10BryanDavis: varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) [23:44:49] (03PS2) 10BryanDavis: varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) [23:50:40] 06SRE, 06Traffic, 13Patch-For-Review: Outdated link to User-Agent Policy in Varnish 403 and 429 responses - https://phabricator.wikimedia.org/T400421#11033332 (10bd808) [23:51:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172433 (owner: 10TrainBranchBot) [23:52:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown