[00:08:13] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172126
[00:08:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172126 (owner: 10TrainBranchBot)
[00:27:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172126 (owner: 10TrainBranchBot)
[00:34:09] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch-disable-readahead-production-search-psi-codfw.service on cirrussearch2071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:44:09] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-psi-codfw.service on cirrussearch2071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:46:41] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/e4ff9c7720fba1048fff41692f396062e64759d90d363518465c4b72e4daaf8a/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[00:59:09] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-psi-codfw.service on cirrussearch2071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:01:51] <logmsgbot>	 !log ryankemper@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - ryankemper@cumin1002 - T397227
[01:01:56] <stashbot>	 T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227
[01:06:41] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:14:09] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-psi-codfw.service on cirrussearch2071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:28:11] <ryankemper>	 !log [Cirrus] `ryankemper@cirrussearch2071:~$ sudo systemctl restart opensearch-disable-readahead-production-search-psi-codfw.service`
[01:28:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:27:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:37:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[03:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:11:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:32:19] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:35:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:56:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[03:56:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[03:59:05] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:59:11] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:00:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:01:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:01:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[04:02:19] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:03:21] <icinga-wm>	 PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 148564 MB (3% inode=99%): /var/lib/hadoop/data/g 147180 MB (3% inode=99%): /var/lib/hadoop/data/j 147069 MB (3% inode=99%): /var/lib/hadoop/data/c 141308 MB (3% inode=99%): /var/lib/hadoop/data/b 146487 MB (3% inode=99%): /var/lib/hadoop/data/l 146676 MB (3% inode=99%): /var/lib/hadoop/data/k 146423 MB (3% inode=99%): /var/lib/hadoop/data
[04:03:21] <icinga-wm>	 1 MB (4% inode=99%): /var/lib/hadoop/data/i 144202 MB (3% inode=99%): /var/lib/hadoop/data/m 153542 MB (4% inode=99%): /var/lib/hadoop/data/d 153568 MB (4% inode=99%): /var/lib/hadoop/data/h 149908 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops
[05:28:44] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2218.codfw.wmnet with reason: Maintenance
[05:32:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P79785 and previous config saved to /var/cache/conftool/dbconfig/20250724-053236-root.json
[05:47:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79786 and previous config saved to /var/cache/conftool/dbconfig/20250724-054743-root.json
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T0600).
[06:02:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79787 and previous config saved to /var/cache/conftool/dbconfig/20250724-060249-root.json
[06:05:51] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionze es1049-es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1172182 (https://phabricator.wikimedia.org/T400198)
[06:17:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79788 and previous config saved to /var/cache/conftool/dbconfig/20250724-061755-root.json
[06:33:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79789 and previous config saved to /var/cache/conftool/dbconfig/20250724-063300-root.json
[06:47:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionze es1049-es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1172182 (https://phabricator.wikimedia.org/T400198) (owner: 10Marostegui)
[06:50:24] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11029953 (10Marostegui)
[06:50:33] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11029954 (10Marostegui) Patches done
[06:51:58] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[06:52:15] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[06:52:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T399249)', diff saved to https://phabricator.wikimedia.org/P79790 and previous config saved to /var/cache/conftool/dbconfig/20250724-065222-marostegui.json
[06:52:28] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[06:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:57:51] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionze es2049-es2057 [puppet] - 10https://gerrit.wikimedia.org/r/1172192 (https://phabricator.wikimedia.org/T400195)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:27] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionze es2049-es2057 [puppet] - 10https://gerrit.wikimedia.org/r/1172192 (https://phabricator.wikimedia.org/T400195) (owner: 10Marostegui)
[07:01:13] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11029965 (10Marostegui) Patches done
[07:02:00] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11029966 (10Marostegui)
[07:03:21] <icinga-wm>	 PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 153521 MB (4% inode=99%): /var/lib/hadoop/data/g 156826 MB (4% inode=99%): /var/lib/hadoop/data/j 156367 MB (4% inode=99%): /var/lib/hadoop/data/c 149979 MB (3% inode=99%): /var/lib/hadoop/data/b 155594 MB (4% inode=99%): /var/lib/hadoop/data/l 160933 MB (4% inode=99%): /var/lib/hadoop/data/k 157103 MB (4% inode=99%): /var/lib/hadoop/data
[07:03:21] <icinga-wm>	 1 MB (4% inode=99%): /var/lib/hadoop/data/i 157691 MB (4% inode=99%): /var/lib/hadoop/data/m 155571 MB (4% inode=99%): /var/lib/hadoop/data/d 159581 MB (4% inode=99%): /var/lib/hadoop/data/h 159658 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops
[07:07:23] <wikibugs>	 (03PS1) 10Marostegui: db1227: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172195 (https://phabricator.wikimedia.org/T399955)
[07:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:11:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1227: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172195 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui)
[07:11:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:12:57] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[07:13:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1227 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79791 and previous config saved to /var/cache/conftool/dbconfig/20250724-071300-marostegui.json
[07:20:13] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11029993 (10Marostegui) >>! In T399927#11028165, @Jhancock.wm wrote: > @Marostegui lemme know when you want to do es2036  I can have it ready today if'd like
[07:20:59] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11029994 (10Marostegui)
[07:21:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79792 and previous config saved to /var/cache/conftool/dbconfig/20250724-072100-root.json
[07:36:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79793 and previous config saved to /var/cache/conftool/dbconfig/20250724-073606-root.json
[07:36:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T399249)', diff saved to https://phabricator.wikimedia.org/P79794 and previous config saved to /var/cache/conftool/dbconfig/20250724-073628-marostegui.json
[07:36:34] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[07:41:25] <wikibugs>	 (03CR) 10Vgutierrez: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[07:51:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79795 and previous config saved to /var/cache/conftool/dbconfig/20250724-075112-root.json
[07:51:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P79796 and previous config saved to /var/cache/conftool/dbconfig/20250724-075135-marostegui.json
[07:54:52] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171147 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz)
[07:55:35] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "It seems that something is not working right. If you check Jenkins there seems to be no diffs found https://integration.wikimedia.org/ci/j" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis)
[08:03:06] <wikibugs>	 (03PS8) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[08:04:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[08:06:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79797 and previous config saved to /var/cache/conftool/dbconfig/20250724-080617-root.json
[08:06:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P79798 and previous config saved to /var/cache/conftool/dbconfig/20250724-080643-marostegui.json
[08:10:04] <wikibugs>	 (03Abandoned) 10Bartosz Wójtowicz: ml-services: Update image version for revertrisk models on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171147 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz)
[08:14:24] <wikibugs>	 (03CR) 10Elukey: "Thanks for the review Jesse!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey)
[08:15:42] <wikibugs>	 06SRE, 10vm-requests, 13Patch-For-Review: eqiad: VMs requested for Data Persistence automation and testbeds - https://phabricator.wikimedia.org/T390087#11030139 (10FCeratto-WMF) I opened a puppet CR with the following setup: ` db-test1000  eqiad primary master db-test1001 db-test1002 db-test2001 codfw dc-mas...
[08:16:28] <wikibugs>	 (03PS9) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[08:17:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[08:20:32] <wikibugs>	 (03CR) 10Volans: "replied inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey)
[08:21:06] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] zookeeper: Add an-druid100[45] to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1171208 (https://phabricator.wikimedia.org/T397440) (owner: 10Stevemunene)
[08:21:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T399249)', diff saved to https://phabricator.wikimedia.org/P79799 and previous config saved to /var/cache/conftool/dbconfig/20250724-082150-marostegui.json
[08:21:56] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[08:22:06] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[08:22:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T399249)', diff saved to https://phabricator.wikimedia.org/P79800 and previous config saved to /var/cache/conftool/dbconfig/20250724-082213-marostegui.json
[08:22:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11030150 (10elukey)
[08:23:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11030153 (10elukey)
[08:23:50] <wikibugs>	 (03PS10) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[08:25:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[08:28:41] <wikibugs>	 (03PS11) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[08:30:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[08:33:15] <wikibugs>	 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11030188 (10fgiunchedi) In that case yes it seems an ad-hoc prometheus instance to run compaction on blocks might be viable, cfr https://github....
[08:36:13] <wikibugs>	 (03CR) 10Fabfur: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[08:40:50] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11030247 (10elukey) Created https://wikitech.wikimedia.org/wiki/Maps/v2/Common_tasks#Warm_up_the_Tegola_tiles_cache_from_scratch
[08:42:50] <wikibugs>	 (03PS12) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[08:44:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[08:56:04] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "for the rps metric it could be because the previous annotations are there. You could try the following 2 options:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis)
[09:03:33] <wikibugs>	 (03PS3) 10Vgutierrez: site,lvs,cumin: Stop using lvs1013 as liberica canary instance [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259)
[09:04:51] <wikibugs>	 (03CR) 10Vgutierrez: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) (owner: 10Vgutierrez)
[09:05:11] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) (owner: 10Vgutierrez)
[09:08:13] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:08:41] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:09:41] <icinga-wm>	 PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[09:10:04] <wikibugs>	 (03PS1) 10Effie Mouzeli: prometheus::ops: fix hcaptcha query [puppet] - 10https://gerrit.wikimedia.org/r/1172264
[09:10:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus::ops: fix hcaptcha query [puppet] - 10https://gerrit.wikimedia.org/r/1172264 (owner: 10Effie Mouzeli)
[09:11:13] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:11:31] <icinga-wm>	 RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[09:11:37] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:12:35] <wikibugs>	 (03PS2) 10Effie Mouzeli: prometheus::ops: fix hcaptcha query [puppet] - 10https://gerrit.wikimedia.org/r/1172264
[09:12:52] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs1013.eqiad.wmnet} and A:liberica (T400259)
[09:12:57] <stashbot>	 T400259: Stop using lvs1013 as a liberica canary - https://phabricator.wikimedia.org/T400259
[09:13:12] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs1013.eqiad.wmnet} and A:liberica (T400259)
[09:15:55] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] site,lvs,cumin: Stop using lvs1013 as liberica canary instance [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) (owner: 10Vgutierrez)
[09:18:05] <wikibugs>	 (03PS1) 10Elukey: redfish: simplify change_user_password for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172265 (https://phabricator.wikimedia.org/T396365)
[09:19:21] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] prometheus::ops: fix hcaptcha query [puppet] - 10https://gerrit.wikimedia.org/r/1172264 (owner: 10Effie Mouzeli)
[09:22:26] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm
[09:25:40] <wikibugs>	 (03PS13) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[09:26:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] redfish: simplify change_user_password for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172265 (https://phabricator.wikimedia.org/T396365) (owner: 10Elukey)
[09:27:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[09:29:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:30:15] <vgutierrez>	 that's expected, no more liberica instances in eqiad at the moment
[09:31:46] <wikibugs>	 (03CR) 10Effie Mouzeli: "Thank you for this Lucas! We had attempted in the past to enable coredumps properly, and run into issues such as servers becoming unrespon" [puppet] - 10https://gerrit.wikimedia.org/r/1172010 (https://phabricator.wikimedia.org/T400247) (owner: 10Lucas Werkmeister (WMDE))
[09:31:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T399249)', diff saved to https://phabricator.wikimedia.org/P79801 and previous config saved to /var/cache/conftool/dbconfig/20250724-093158-marostegui.json
[09:32:04] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[09:34:09] <wikibugs>	 07Puppet, 06SRE, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Puppet configures kernel.core_pattern |/usr/lib/systemd/systemd-coredump, but systemd-coredump is not installed - https://phabricator.wikimedia.org/T400247#11030334 (10jijiki) (cp from gerrit comment)  We had attempted in the past to enabl...
[09:34:28] <wikibugs>	 (03CR) 10Clément Goubert: "Yeah, the linked task is abandoned as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert)
[09:34:40] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet configures kernel.core_pattern |/usr/lib/systemd/systemd-coredump, but systemd-coredump is not installed - https://phabricator.wikimedia.org/T400247#11030335 (10jijiki) p:05Triage→03Low
[09:34:40] <wikibugs>	 (03Abandoned) 10Clément Goubert: eventgate-logging-external: Remove CPU limit for tls-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert)
[09:36:21] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage
[09:36:58] <wikibugs>	 (03PS14) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[09:37:08] <vgutierrez>	 !log disable BGP for lvs1013 on lsw1-e1-eqiad.mgmt.eqiad.wmnet - T400259
[09:37:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:13] <stashbot>	 T400259: Stop using lvs1013 as a liberica canary - https://phabricator.wikimedia.org/T400259
[09:38:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[09:39:12] <wikibugs>	 (03PS15) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[09:40:03] <wikibugs>	 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11030353 (10Tgr) See T392633#10776362 for a full list of session tokens. We plan to treat everything other than OAuth 2 and session cookies as anonymous for rate limiting purposes, so I imagine you don't care about validat...
[09:41:28] <wikibugs>	 (03CR) 10Fabfur: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[09:42:53] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage
[09:44:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:47:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P79803 and previous config saved to /var/cache/conftool/dbconfig/20250724-094706-marostegui.json
[09:53:14] <hnowlan>	 jouncebot: nowandnext
[09:53:14] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 6 minute(s)
[09:53:15] <jouncebot>	 In 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1000)
[09:54:00] <wikibugs>	 (03CR) 10Vgutierrez: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[09:54:10] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[09:55:22] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[09:56:25] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] wmnet: Remove maintenance.eqiad.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/1171983 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert)
[09:57:14] <logmsgbot>	 !log cgoubert@dns1004 START - running authdns-update
[09:57:22] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1013.eqiad.wmnet with OS bookworm
[09:57:41] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
[09:57:43] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[09:58:07] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync
[09:58:11] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[09:58:13] <logmsgbot>	 !log cgoubert@dns1004 END - running authdns-update
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1000)
[10:01:36] <wikibugs>	 (03PS16) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[10:01:54] <wikibugs>	 (03CR) 10Fabfur: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[10:02:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P79804 and previous config saved to /var/cache/conftool/dbconfig/20250724-100213-marostegui.json
[10:05:11] <wikibugs>	 (03CR) 10AikoChou: ml-services: update RRLA and RRML images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[10:06:21] <wikibugs>	 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334 (10dcaro) 03NEW
[10:09:00] <wikibugs>	 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11030433 (10dcaro)
[10:12:20] <wikibugs>	 (03PS2) 10Elukey: redfish: simplify change_user_password for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172265 (https://phabricator.wikimedia.org/T396365)
[10:14:27] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172273
[10:17:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T399249)', diff saved to https://phabricator.wikimedia.org/P79805 and previous config saved to /var/cache/conftool/dbconfig/20250724-101721-marostegui.json
[10:17:27] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[10:17:36] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[10:19:52] <wikibugs>	 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11030440 (10dcaro)
[10:24:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter2005:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[10:24:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:31:21] <wikibugs>	 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11030444 (10dcaro) Doing this upgrade, the mons crashed, the error they shown was about using an old mon...
[10:34:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter2005:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[10:34:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:37:18] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, the code could potentially benefit from some refactoring at this point, not a blocker." [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney)
[10:41:19] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[10:43:08] <wikibugs>	 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11030459 (10dcaro) p:05Triage→03High
[10:44:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:46:40] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[10:47:43] <wikibugs>	 (03CR) 10Volans: Capirca: handle script having no 'status' attribute gracefully (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney)
[10:49:29] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2149.codfw.wmnet with reason: Maintenance
[10:49:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T399728)', diff saved to https://phabricator.wikimedia.org/P79806 and previous config saved to /var/cache/conftool/dbconfig/20250724-104938-fceratto.json
[10:49:43] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[10:52:56] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: update RRLA and RRML images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[10:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:58:00] <wikibugs>	 (03PS1) 10Phuedx: MetricsPlatform: Disable synchronous configs fetching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172279 (https://phabricator.wikimedia.org/T398422)
[10:58:09] <wikibugs>	 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11030499 (10dcaro) The client on 2004 keeps getting connection refused: ` 148677 connect(12, {sa_family=...
[11:04:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T399728)', diff saved to https://phabricator.wikimedia.org/P79807 and previous config saved to /var/cache/conftool/dbconfig/20250724-110412-fceratto.json
[11:04:17] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[11:06:46] <wikibugs>	 (03PS8) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[11:07:52] <wikibugs>	 (03CR) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney)
[11:08:22] <wikibugs>	 (03PS9) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[11:08:56] <wikibugs>	 (03PS10) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[11:09:03] <wikibugs>	 (03PS11) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[11:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:11:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:18:03] <wikibugs>	 (03PS6) 10Cathal Mooney: JunOS: pass ignore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261)
[11:18:58] <wikibugs>	 (03CR) 10Cathal Mooney: JunOS: pass ignore_warnings list to diff() and rollback() functions (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney)
[11:19:09] <wikibugs>	 (03PS7) 10Cathal Mooney: JunOS: pass ignore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261)
[11:19:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P79808 and previous config saved to /var/cache/conftool/dbconfig/20250724-111919-fceratto.json
[11:20:02] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[11:20:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T399249)', diff saved to https://phabricator.wikimedia.org/P79809 and previous config saved to /var/cache/conftool/dbconfig/20250724-112008-marostegui.json
[11:20:14] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[11:20:25] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney)
[11:27:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney)
[11:34:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P79810 and previous config saved to /var/cache/conftool/dbconfig/20250724-113427-fceratto.json
[11:35:28] <wikibugs>	 (03PS12) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[11:35:36] <wikibugs>	 (03PS13) 10Cathal Mooney: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[11:42:58] <wikibugs>	 (03PS1) 10Gmodena: data-engineering: eventbus: increase anomaly detection threshold [alerts] - 10https://gerrit.wikimedia.org/r/1172280 (https://phabricator.wikimedia.org/T398437)
[11:48:51] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C:03+2] data-engineering: eventbus: increase anomaly detection threshold [alerts] - 10https://gerrit.wikimedia.org/r/1172280 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena)
[11:49:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T399728)', diff saved to https://phabricator.wikimedia.org/P79812 and previous config saved to /var/cache/conftool/dbconfig/20250724-114934-fceratto.json
[11:49:40] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[11:49:50] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2156.codfw.wmnet with reason: Maintenance
[11:49:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T399728)', diff saved to https://phabricator.wikimedia.org/P79813 and previous config saved to /var/cache/conftool/dbconfig/20250724-114957-fceratto.json
[11:50:01] <wikibugs>	 (03Merged) 10jenkins-bot: data-engineering: eventbus: increase anomaly detection threshold [alerts] - 10https://gerrit.wikimedia.org/r/1172280 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena)
[11:50:54] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: "LGTM, thank you for this work Kevin! ❤️" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[11:51:05] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: update RRLA and RRML images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[11:58:36] <wikibugs>	 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344 (10diego) 03NEW
[11:59:03] <wikibugs>	 (03PS1) 10Btullis: Dumps: bump the mediawiki image deployed to the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172285 (https://phabricator.wikimedia.org/T394389)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1200)
[12:00:18] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, would be nice to add a test for it ;)" [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney)
[12:01:58] <wikibugs>	 (03PS2) 10Btullis: Dumps: bump the mediawiki image deployed to the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172285 (https://phabricator.wikimedia.org/T394389)
[12:02:10] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] JunOS: pass ignore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney)
[12:03:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T399249)', diff saved to https://phabricator.wikimedia.org/P79814 and previous config saved to /var/cache/conftool/dbconfig/20250724-120319-marostegui.json
[12:03:26] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[12:04:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T399728)', diff saved to https://phabricator.wikimedia.org/P79815 and previous config saved to /var/cache/conftool/dbconfig/20250724-120422-fceratto.json
[12:04:29] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[12:07:22] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Dumps: bump the mediawiki image deployed to the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172285 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis)
[12:09:08] <wikibugs>	 (03Merged) 10jenkins-bot: Dumps: bump the mediawiki image deployed to the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172285 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis)
[12:18:07] <wikibugs>	 (03Merged) 10jenkins-bot: JunOS: pass ignore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney)
[12:18:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P79816 and previous config saved to /var/cache/conftool/dbconfig/20250724-121827-marostegui.json
[12:19:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P79817 and previous config saved to /var/cache/conftool/dbconfig/20250724-121930-fceratto.json
[12:26:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney)
[12:29:28] <logmsgbot>	 !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on gerrit2003.wikimedia.org with reason: maintenance
[12:31:06] <logmsgbot>	 !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on gerrit2002.wikimedia.org with reason: maintenance
[12:32:41] <wikibugs>	 06SRE, 10Hiddenparma, 06Traffic: Browser behaviour detection at the edge - https://phabricator.wikimedia.org/T400270#11030895 (10Vgutierrez) We might need to perform some of this work on HAProxy given it has direct access to the client connection and its properties
[12:32:54] <wikibugs>	 (03PS5) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162)
[12:33:15] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the reviews! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[12:33:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P79818 and previous config saved to /var/cache/conftool/dbconfig/20250724-123334-marostegui.json
[12:34:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P79819 and previous config saved to /var/cache/conftool/dbconfig/20250724-123437-fceratto.json
[12:34:48] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update RRLA and RRML images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[12:35:59] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1013.eqiad.wmnet with OS trixie
[12:39:08] <wikibugs>	 (03Merged) 10jenkins-bot: Capirca: deal with scenario when netbox script has never run [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney)
[12:40:42] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[12:42:30] <wikibugs>	 (03CR) 10Gkyziridis: "Thnx for that smart catch. In this patch there are diffs in the console,  although this is only the remove of the `maxReplicas`. Nothing r" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis)
[12:47:15] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1013.eqiad.wmnet with reason: host reimage
[12:48:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T399249)', diff saved to https://phabricator.wikimedia.org/P79820 and previous config saved to /var/cache/conftool/dbconfig/20250724-124842-marostegui.json
[12:48:48] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[12:48:57] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[12:49:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T399249)', diff saved to https://phabricator.wikimedia.org/P79821 and previous config saved to /var/cache/conftool/dbconfig/20250724-124904-marostegui.json
[12:49:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T399728)', diff saved to https://phabricator.wikimedia.org/P79822 and previous config saved to /var/cache/conftool/dbconfig/20250724-124944-fceratto.json
[12:49:50] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[12:49:55] <wikibugs>	 (03PS1) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172313
[12:50:10] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2177.codfw.wmnet with reason: Maintenance
[12:50:17] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Okay… should I just abandon this patch then? (IIUC, the status quo is that mw-on-k8s VMs and most other machines drop core dumps into file" [puppet] - 10https://gerrit.wikimedia.org/r/1172010 (https://phabricator.wikimedia.org/T400247) (owner: 10Lucas Werkmeister (WMDE))
[12:50:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T399728)', diff saved to https://phabricator.wikimedia.org/P79823 and previous config saved to /var/cache/conftool/dbconfig/20250724-125017-fceratto.json
[12:53:20] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1013.eqiad.wmnet with reason: host reimage
[12:56:16] <wikibugs>	 (03PS10) 10Stang: zhwiki: Allow local securepoll setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020)
[12:57:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11031015 (10Jclark-ctr) @cmooney @ ayounsi   This morning, I updated NetBox with names and locations for all refresh switches and ran two new console cables to SCS. I also v...
[12:57:58] <wikibugs>	 (03PS2) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172313
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1300).
[13:00:05] <jouncebot>	 danisztls: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:20] <wikibugs>	 (03CR) 10Stang: zhwiki: Allow local securepoll setup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang)
[13:00:50] <danisztls>	 o/
[13:01:05] <Lucas_WMDE>	 o/
[13:02:04] <Lucas_WMDE>	 I can deploy ^^
[13:04:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170760 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza)
[13:04:19] <wikibugs>	 (03Abandoned) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172313 (owner: 10Cathal Mooney)
[13:04:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T399728)', diff saved to https://phabricator.wikimedia.org/P79824 and previous config saved to /var/cache/conftool/dbconfig/20250724-130441-fceratto.json
[13:04:47] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[13:05:17] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Readers Use Cases Survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170760 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza)
[13:05:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1170760|Deploy Readers Use Cases Survey v2 (T399736)]]
[13:05:59] <stashbot>	 T399736: Open-ended survey of English Wikipedia readers v2 - https://phabricator.wikimedia.org/T399736
[13:06:01] <wikibugs>	 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11031044 (10elukey) @Mvolz I created another set of graphs: https://w.wiki/Eq8o. Note that they are all DC agnostic, since we could/merge eqiad|co...
[13:08:28] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003"
[13:09:29] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003"
[13:09:29] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1013.eqiad.wmnet with OS trixie
[13:09:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, dani: Backport for [[gerrit:1170760|Deploy Readers Use Cases Survey v2 (T399736)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:10:08] <Lucas_WMDE>	 danisztls: can you test the change?
[13:10:11] <danisztls>	 Lucas_WMDE: since this change just bumps the coverage I already tested what could be tested
[13:10:14] <Lucas_WMDE>	 ok
[13:10:43] <Lucas_WMDE>	 yeah hitting mwdebug with 400+ requests just to get the survey is probably not a useful use of server or human time
[13:10:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, dani: Continuing with sync
[13:11:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11031062 (10elukey) Reimaged ml-serve1013 with Trixie:  `
[13:13:36] <danisztls>	 Lucas_WMDE: yeah, could write a script for that but it would be pointless
[13:14:25] <danisztls>	 Lucas_WMDE: thanks for deploying
[13:15:58] <wikibugs>	 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11031085 (10dcaro) I was able to get the mon working by disabling cephx on the config, and only setting...
[13:18:01] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170760|Deploy Readers Use Cases Survey v2 (T399736)]] (duration: 12m 07s)
[13:18:08] <stashbot>	 T399736: Open-ended survey of English Wikipedia readers v2 - https://phabricator.wikimedia.org/T399736
[13:19:05] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:17] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Add db224[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1172319 (https://phabricator.wikimedia.org/T400213)
[13:19:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P79826 and previous config saved to /var/cache/conftool/dbconfig/20250724-131949-fceratto.json
[13:22:42] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Add db224[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1172319 (https://phabricator.wikimedia.org/T400213) (owner: 10Marostegui)
[13:23:54] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11031099 (10Marostegui) Patches are done
[13:24:08] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11031100 (10Marostegui)
[13:32:16] <wikibugs>	 (03CR) 10Novem Linguae: [C:03+1] "Code looks good. Overall, this is very similar to enwiki's setup, with one difference being that they are creating a scrutineer group for " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang)
[13:34:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T399249)', diff saved to https://phabricator.wikimedia.org/P79828 and previous config saved to /var/cache/conftool/dbconfig/20250724-133439-marostegui.json
[13:34:45] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[13:34:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P79829 and previous config saved to /var/cache/conftool/dbconfig/20250724-133456-fceratto.json
[13:35:11] <wikibugs>	 (03PS1) 10Dreamy Jazz: Make TSP extensions have warning logs in logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172323
[13:35:17] <Dreamy_Jazz>	 jouncebot: nowandnext
[13:35:17] <jouncebot>	 For the next 0 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1300)
[13:35:17] <jouncebot>	 In 0 hour(s) and 54 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1430)
[13:36:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172323 (owner: 10Dreamy Jazz)
[13:37:05] <wikibugs>	 (03Merged) 10jenkins-bot: Make TSP extensions have warning logs in logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172323 (owner: 10Dreamy Jazz)
[13:37:25] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1172323|Make TSP extensions have warning logs in logstash]]
[13:39:38] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1172323|Make TSP extensions have warning logs in logstash]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:40:57] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[13:41:57] <wikibugs>	 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11031179 (10dcaro) with this, I added a few of the config values back: ` root@cloudcephmon2004-dev:~# ce...
[13:43:24] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search,name=codfw
[13:46:17] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172323|Make TSP extensions have warning logs in logstash]] (duration: 08m 51s)
[13:49:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P79830 and previous config saved to /var/cache/conftool/dbconfig/20250724-134946-marostegui.json
[13:50:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T399728)', diff saved to https://phabricator.wikimedia.org/P79831 and previous config saved to /var/cache/conftool/dbconfig/20250724-135004-fceratto.json
[13:50:09] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[13:50:20] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2190.codfw.wmnet with reason: Maintenance
[13:50:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T399728)', diff saved to https://phabricator.wikimedia.org/P79832 and previous config saved to /var/cache/conftool/dbconfig/20250724-135027-fceratto.json
[13:51:54] <wikibugs>	 (03CR) 10Xcollazo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo)
[14:04:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P79833 and previous config saved to /var/cache/conftool/dbconfig/20250724-140454-marostegui.json
[14:05:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T399728)', diff saved to https://phabricator.wikimedia.org/P79834 and previous config saved to /var/cache/conftool/dbconfig/20250724-140519-fceratto.json
[14:05:25] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[14:09:54] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[14:10:41] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[14:11:04] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11031334 (10Jhancock.wm) @Marostegui today or tomorrow is fine.
[14:12:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11031341 (10Jhancock.wm) @jhathaway i can take care of this connection today. Can you remind me what the hostname of this server is?
[14:15:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11031357 (10Jhancock.wm) @cirrussearch2091 john took a look at it for me and it looks like he was able to get it...
[14:16:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11031359 (10Scott_French)
[14:18:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11031384 (10jhathaway) >>! In T400211#11031335, @Jhancock.wm wrote: > @jhathaway i can take care of this connection today. Can you remind me what the host...
[14:20:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T399249)', diff saved to https://phabricator.wikimedia.org/P79836 and previous config saved to /var/cache/conftool/dbconfig/20250724-142001-marostegui.json
[14:20:07] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[14:20:18] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[14:20:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T399249)', diff saved to https://phabricator.wikimedia.org/P79837 and previous config saved to /var/cache/conftool/dbconfig/20250724-142024-marostegui.json
[14:20:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P79838 and previous config saved to /var/cache/conftool/dbconfig/20250724-142033-fceratto.json
[14:22:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11031413 (10bking) @Jhancock.wm sure, can y'all try installing Bullseye on it? I only switched it to UEFI because...
[14:22:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#11031414 (10Jhancock.wm) @bking i rebooted the idrac on this server. It's about all i can do for the moment. is there a time we can depool this server...
[14:24:38] <logmsgbot>	 !log bking@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cirrussearch2079.codfw.wmnet with reason: T396718
[14:24:43] <stashbot>	 T396718: cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718
[14:24:54] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye
[14:25:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11031424 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host c...
[14:25:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#11031430 (10bking) @Jhancock.wm Sorry we did not respond to this one sooner. The host is downtimed and depooled, feel free to reboot whenever.
[14:27:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11031448 (10Scott_French) No further manual clean-up actions are currently planned, though there will be various spot fixes as teams update their build...
[14:27:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11031449 (10Scott_French) 05Open→03Resolved
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1430)
[14:33:01] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:35:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P79839 and previous config saved to /var/cache/conftool/dbconfig/20250724-143541-fceratto.json
[14:41:10] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2091.codfw.wmnet with reason: host reimage
[14:44:09] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2091.codfw.wmnet with reason: host reimage
[14:44:57] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:48:30] <wikibugs>	 (03PS1) 10Federico Ceratto: zarcillo: Enable egress for Alertmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172334 (https://phabricator.wikimedia.org/T384810)
[14:48:30] <wikibugs>	 (03CR) 10Federico Ceratto: "A simple change to enable egress to Alertmanager" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172334 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[14:50:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T399728)', diff saved to https://phabricator.wikimedia.org/P79840 and previous config saved to /var/cache/conftool/dbconfig/20250724-145048-fceratto.json
[14:50:54] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[14:51:04] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2194.codfw.wmnet with reason: Maintenance
[14:51:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T399728)', diff saved to https://phabricator.wikimedia.org/P79841 and previous config saved to /var/cache/conftool/dbconfig/20250724-145112-fceratto.json
[14:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:55:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170616 (owner: 10Krinkle)
[14:55:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170614 (owner: 10Krinkle)
[14:56:28] <wikibugs>	 (03Merged) 10jenkins-bot: build: Fix failing `phpcs` in CI on commits updating interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170616 (owner: 10Krinkle)
[14:56:36] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170614 (owner: 10Krinkle)
[14:56:39] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "LGTM, leaving to o11y if we're ok with reaching out to AM from k8s (may need some firewall rule on AM side?)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172334 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[14:56:58] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1170616|build: Fix failing `phpcs` in CI on commits updating interwiki.php]], [[gerrit:1170614|Update interwiki cache]]
[14:59:08] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1170616|build: Fix failing `phpcs` in CI on commits updating interwiki.php]], [[gerrit:1170614|Update interwiki cache]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:00:05] <jouncebot>	 dduvall and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1500).
[15:00:24] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[15:01:16] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2091.codfw.wmnet with OS bullseye
[15:01:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11031690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cirru...
[15:05:47] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170616|build: Fix failing `phpcs` in CI on commits updating interwiki.php]], [[gerrit:1170614|Update interwiki cache]] (duration: 08m 48s)
[15:06:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T399728)', diff saved to https://phabricator.wikimedia.org/P79842 and previous config saved to /var/cache/conftool/dbconfig/20250724-150605-fceratto.json
[15:06:11] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[15:06:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T399249)', diff saved to https://phabricator.wikimedia.org/P79843 and previous config saved to /var/cache/conftool/dbconfig/20250724-150622-marostegui.json
[15:06:28] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[15:06:40] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:06:42] <Krinkle>	 quiddity: Interwiki for mediawiki.org in Parsoid/VisualEditor now defaults to mw: again like before :)
[15:09:05] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] redfish: simplify change_user_password for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172265 (https://phabricator.wikimedia.org/T396365) (owner: 10Elukey)
[15:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[15:09:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:45] <wikibugs>	 (03PS2) 10Btullis: dns: Add a VIP for dse-k8s-ctrl.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1171592 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene)
[15:10:41] <wikibugs>	 (03CR) 10Btullis: [C:03+1] dns: Add a VIP for dse-k8s-ctrl.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1171592 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene)
[15:11:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11031731 (10Jhancock.wm) @jhathaway that one took. Still working with Dell on the other server. I'll let you know...
[15:11:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:13:06] <swfrench-wmf>	 !log reprepro include php-xhprof_2.3.10-1+wmf11u1 in component/php81 - T398245
[15:13:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:12] <stashbot>	 T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245
[15:13:52] <swfrench-wmf>	 !log reprepro include php-xhprof_2.3.10-1+wmf11u1 tideways_5.0.4-16+wmf11u2 in component/php83 - T398245
[15:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:21:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P79844 and previous config saved to /var/cache/conftool/dbconfig/20250724-152113-fceratto.json
[15:21:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P79845 and previous config saved to /var/cache/conftool/dbconfig/20250724-152129-marostegui.json
[15:28:49] <wikibugs>	 (03CR) 10JHathaway: sre.hosts.provision: add custom settings for Supermicro (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey)
[15:31:20] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provision: add custom settings for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey)
[15:33:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11031851 (10Jhancock.wm) I'm not sure for the wording and want to clarify. Are you ordering the items needed? Or do we need to start a procurement task?
[15:34:50] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis)
[15:36:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P79846 and previous config saved to /var/cache/conftool/dbconfig/20250724-153620-fceratto.json
[15:36:29] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "I've updated the patch to use the old schema for now so that we can go ahead and deploy this and we can figure out the new annotations aft" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis)
[15:36:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P79847 and previous config saved to /var/cache/conftool/dbconfig/20250724-153637-marostegui.json
[15:37:17] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2079']
[15:37:35] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cirrussearch2079']
[15:39:13] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis)
[15:44:37] <wikibugs>	 (03PS1) 10Fabfur: haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039)
[15:45:45] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis)
[15:46:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[15:46:45] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[15:47:27] <wikibugs>	 (03CR) 10Ozge: [C:03+1] ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis)
[15:48:09] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[15:48:22] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[15:51:06] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2079']
[15:51:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T399728)', diff saved to https://phabricator.wikimedia.org/P79848 and previous config saved to /var/cache/conftool/dbconfig/20250724-155128-fceratto.json
[15:51:33] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[15:51:44] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2209.codfw.wmnet with reason: Maintenance
[15:51:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T399249)', diff saved to https://phabricator.wikimedia.org/P79849 and previous config saved to /var/cache/conftool/dbconfig/20250724-155144-marostegui.json
[15:51:51] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[15:51:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T399728)', diff saved to https://phabricator.wikimedia.org/P79850 and previous config saved to /var/cache/conftool/dbconfig/20250724-155151-fceratto.json
[15:51:59] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[15:52:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T399249)', diff saved to https://phabricator.wikimedia.org/P79851 and previous config saved to /var/cache/conftool/dbconfig/20250724-155206-marostegui.json
[15:53:28] <wikibugs>	 (03PS1) 10Btullis: Bump the flink-operator image to version 1.12.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172351 (https://phabricator.wikimedia.org/T398162)
[15:55:59] <wikibugs>	 (03CR) 10Btullis: "This will require a corresponding change in deployment-charts to do the following:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172351 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis)
[15:57:21] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11031974 (10fnegri)
[15:58:21] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2079']
[15:58:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#11031980 (10Jhancock.wm) a:03Jhancock.wm updated the idrac firmware manually. then tested a run of firmware update script to see if it would connect....
[15:59:29] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11031986 (10jhathaway) >>! In T400211#11031851, @Jhancock.wm wrote: > I'm not sure for the wording and want to clarify. Are you ordering the items needed?...
[16:00:05] <jouncebot>	 jhathaway and moritzm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:48] <wikibugs>	 (03CR) 10JHathaway: sre.hosts.provision: add custom settings for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey)
[16:00:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11031990 (10Jhancock.wm) @RobH we can probably discuss this during our meeting this afternoon. Need to order the two items linked in the description.
[16:04:55] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] Bump the flink-operator image to version 1.12.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172351 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis)
[16:06:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T399728)', diff saved to https://phabricator.wikimedia.org/P79852 and previous config saved to /var/cache/conftool/dbconfig/20250724-160643-fceratto.json
[16:06:46] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.190.0" for 2 host(s)
[16:06:49] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[16:07:44] <wikibugs>	 (03PS1) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172352
[16:08:32] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.190.0" completed for 2 hosts
[16:21:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P79854 and previous config saved to /var/cache/conftool/dbconfig/20250724-162150-fceratto.json
[16:22:48] <jinxer-wm>	 FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[16:24:15] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172352 (owner: 10Cathal Mooney)
[16:27:09] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1024.eqiad.wmnet with OS bookworm
[16:32:01] <vgutierrez>	 should we worry about that Thumbor alert?
[16:32:06] <vgutierrez>	 Emperor: ^^?
[16:33:25] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance
[16:34:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2036 T399927', diff saved to https://phabricator.wikimedia.org/P79855 and previous config saved to /var/cache/conftool/dbconfig/20250724-163439-root.json
[16:34:46] <stashbot>	 T399927: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927
[16:35:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T399249)', diff saved to https://phabricator.wikimedia.org/P79856 and previous config saved to /var/cache/conftool/dbconfig/20250724-163553-marostegui.json
[16:35:58] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[16:36:21] <hnowlan>	 looks like there's a bad thumbor pod 
[16:36:22] <hnowlan>	 looking
[16:36:54] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.10.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1172352 (owner: 10Cathal Mooney)
[16:36:57] <Emperor>	 vgutierrez: FTR, I'm not a thumbor expert (but it looks like one has appeared :) )
[16:36:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P79857 and previous config saved to /var/cache/conftool/dbconfig/20250724-163658-fceratto.json
[16:37:12] <vgutierrez>	 hnowlan: thx <3
[16:38:10] <quiddity>	 Krinkle: Thank you for the fix and for letting me know!
[16:39:24] <hnowlan>	 pod cordoned, error rate will hopefully dip a little - that alert is a little misleading as it's a high error rate for a single pod rather than the fleet aiui 
[16:39:42] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11032190 (10jhathaway) @KFrancis would you kindly confirm that @Novem_Linguae has signed the NDA?
[16:40:29] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11032194 (10Marostegui) >>! In T399927#11031334, @Jhancock.wm wrote: > @Marostegui today or tomorrow is fine.  es2036 is ready for you
[16:41:02] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11032199 (10Novem_Linguae) I haven't signed it yet. I'm happy to do so. Just need instructions on how to get that started.
[16:42:48] <jinxer-wm>	 RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[16:43:04] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1024.eqiad.wmnet with reason: host reimage
[16:44:58] <wikibugs>	 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11032207 (10jhathaway)
[16:48:51] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1024.eqiad.wmnet with reason: host reimage
[16:49:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11032225 (10RobH) I'm a bit surprised the server doesn't have the db connection cable, but we can order that from SM store or Rich@SM via quotation.  Then...
[16:50:47] <wikibugs>	 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11032232 (10jhathaway) @Miriam would you kindly approve @diego's request to be added to analytics-research-admins?
[16:51:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P79858 and previous config saved to /var/cache/conftool/dbconfig/20250724-165100-marostegui.json
[16:52:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T399728)', diff saved to https://phabricator.wikimedia.org/P79859 and previous config saved to /var/cache/conftool/dbconfig/20250724-165205-fceratto.json
[16:52:10] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[16:52:21] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2227.codfw.wmnet with reason: Maintenance
[16:52:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2227 (T399728)', diff saved to https://phabricator.wikimedia.org/P79860 and previous config saved to /var/cache/conftool/dbconfig/20250724-165228-fceratto.json
[16:53:30] <hnowlan>	 !log delete thumbor pod where all instances displayed signs of T374350
[16:53:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:35] <stashbot>	 T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error - https://phabricator.wikimedia.org/T374350
[16:55:39] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Enable the CampaignEvents extension on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172362 (https://phabricator.wikimedia.org/T397369)
[16:56:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172362 (https://phabricator.wikimedia.org/T397369) (owner: 10Daimona Eaytoy)
[16:56:28] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 83096MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[17:00:05] <jouncebot>	 bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1700)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1700)
[17:00:13] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06SRE Observability: Logstash access for HCoplin - https://phabricator.wikimedia.org/T400288#11032270 (10jhathaway) @HCoplin-WMF happy to help grant you access. You may be able to request access through our new IDM tool, https://idm.wikimedia.org. Can you try logging in and requ...
[17:03:25] <bd808>	 looks like developer portal should release but also like the base image there needs some attention to fix the container builds. I'll start poking things
[17:06:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P79862 and previous config saved to /var/cache/conftool/dbconfig/20250724-170608-marostegui.json
[17:07:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T399728)', diff saved to https://phabricator.wikimedia.org/P79863 and previous config saved to /var/cache/conftool/dbconfig/20250724-170719-fceratto.json
[17:07:25] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[17:14:10] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[17:16:54] <wikibugs>	 10SRE-SLO, 10Observability-Metrics: Clear & Backfill Tonecheck Pyrra Metrics - https://phabricator.wikimedia.org/T400071#11032344 (10herron)  Trying today with an ad-hoc prometheus instance to compact the overlapping blocks before uploading  Generate the backfill blocks again  ` /tmp/backfill/tonecheck$ time p...
[17:16:55] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:17:10] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2036
[17:17:21] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2036
[17:21:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T399249)', diff saved to https://phabricator.wikimedia.org/P79864 and previous config saved to /var/cache/conftool/dbconfig/20250724-172117-marostegui.json
[17:21:22] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[17:21:33] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[17:21:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T399249)', diff saved to https://phabricator.wikimedia.org/P79865 and previous config saved to /var/cache/conftool/dbconfig/20250724-172140-marostegui.json
[17:22:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P79866 and previous config saved to /var/cache/conftool/dbconfig/20250724-172227-fceratto.json
[17:23:36] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[17:26:54] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] opensearch: curator instance config to follow $enable_curator [puppet] - 10https://gerrit.wikimedia.org/r/1171713 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite)
[17:27:20] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11032375 (10Marostegui) es2036 done ` [   21.582858] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 10000 Mbps (NRZ) full duplex, Flow control: none [   2...
[17:27:33] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11032376 (10Marostegui)
[17:28:24] <wikibugs>	 (03PS1) 10Volans: insetup role report: update recipients [puppet] - 10https://gerrit.wikimedia.org/r/1172365
[17:34:21] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1024.eqiad.wmnet with OS bookworm
[17:37:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P79867 and previous config saved to /var/cache/conftool/dbconfig/20250724-173734-fceratto.json
[17:38:30] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1025.eqiad.wmnet with OS bookworm
[17:42:40] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11032404 (10KFrancis) Hi @Novem_Linguae, please send your legal name, postal address, and email to kfrancis@wikimedia.org and I will put the NDA together for you to sign.  Thanks!
[17:47:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P79868 and previous config saved to /var/cache/conftool/dbconfig/20250724-174752-root.json
[17:47:55] <wikibugs>	 (03CR) 10Volans: "Updated reviewers" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper)
[17:48:09] <wikibugs>	 (03CR) 10Volans: "Updated reviewers" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol)
[17:50:39] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/993797 (owner: 10JHathaway)
[17:52:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T399728)', diff saved to https://phabricator.wikimedia.org/P79869 and previous config saved to /var/cache/conftool/dbconfig/20250724-175242-fceratto.json
[17:52:48] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[17:52:49] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1025.eqiad.wmnet with reason: host reimage
[17:52:58] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2239.codfw.wmnet with reason: Maintenance
[17:57:03] <wikibugs>	 (03CR) 10Volans: "@fceratto@wikimedia.org" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans)
[17:57:33] <wikibugs>	 (03CR) 10Volans: "[update]" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888 (owner: 10Volans)
[17:58:57] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1025.eqiad.wmnet with reason: host reimage
[18:00:04] <jouncebot>	 dduvall and dancy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T1800).
[18:02:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79870 and previous config saved to /var/cache/conftool/dbconfig/20250724-180258-root.json
[18:03:51] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172372 (https://phabricator.wikimedia.org/T396372)
[18:03:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172372 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot)
[18:04:47] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172372 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot)
[18:06:19] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[18:07:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T399249)', diff saved to https://phabricator.wikimedia.org/P79871 and previous config saved to /var/cache/conftool/dbconfig/20250724-180758-marostegui.json
[18:08:04] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[18:12:40] <logmsgbot>	 !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.11  refs T396372
[18:12:45] <stashbot>	 T396372: 1.45.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T396372
[18:13:05] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[18:15:59] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1022.eqiad.wmnet with OS bookworm
[18:16:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11032454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm
[18:18:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79872 and previous config saved to /var/cache/conftool/dbconfig/20250724-181803-root.json
[18:22:31] <dduvall>	 train is clear
[18:23:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P79873 and previous config saved to /var/cache/conftool/dbconfig/20250724-182306-marostegui.json
[18:29:44] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1025.eqiad.wmnet with OS bookworm
[18:32:13] <wikibugs>	 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402 (10RobH) 03NEW
[18:33:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79874 and previous config saved to /var/cache/conftool/dbconfig/20250724-183309-root.json
[18:33:16] <wikibugs>	 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11032563 (10RobH)
[18:38:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P79875 and previous config saved to /var/cache/conftool/dbconfig/20250724-183813-marostegui.json
[18:45:47] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1030.eqiad.wmnet with OS bookworm
[18:48:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79876 and previous config saved to /var/cache/conftool/dbconfig/20250724-184815-root.json
[18:51:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11032595 (10jhathaway) @ttaylor when you have a moment, please review and sign the L3 server access document.
[18:51:47] <logmsgbot>	 vriley@cumin1002 reimage (PID 2912686) is awaiting input
[18:52:03] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1022.eqiad.wmnet with OS bookworm
[18:52:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11032596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with...
[18:53:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T399249)', diff saved to https://phabricator.wikimedia.org/P79877 and previous config saved to /var/cache/conftool/dbconfig/20250724-185320-marostegui.json
[18:53:27] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[18:53:36] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[18:53:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T399249)', diff saved to https://phabricator.wikimedia.org/P79878 and previous config saved to /var/cache/conftool/dbconfig/20250724-185343-marostegui.json
[18:54:06] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1022.eqiad.wmnet with OS bookworm
[18:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:54:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11032607 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm
[19:03:12] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1030.eqiad.wmnet with reason: host reimage
[19:06:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405 (10SD0001) 03NEW
[19:08:39] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1030.eqiad.wmnet with reason: host reimage
[19:08:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11032662 (10SD0001)
[19:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[19:11:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:13:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11032667 (10ttaylor) @jhathaway done!
[19:19:57] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:27:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11032702 (10MusikAnimal) Confirming my sponsorship as WMF staff. SD0001 has done fantastic work in improving the performance of DB queries that can't be tested by other means su...
[19:29:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:33:36] <wikibugs>	 06SRE: FY 25/26 WE 5.4.2: Known bots / clients - https://phabricator.wikimedia.org/T400100#11032715 (10Scott_French)
[19:34:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:42:20] <logmsgbot>	 vriley@cumin1002 reimage (PID 2954117) is awaiting input
[19:42:36] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1022.eqiad.wmnet with OS bookworm
[19:42:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11032733 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with...
[19:44:09] <wikibugs>	 (03PS1) 10BCornwall: Add wikipedialibrary.org to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1172389
[19:44:47] <wikibugs>	 (03PS2) 10BCornwall: Add wikipedialibrary.org to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1172389 (https://phabricator.wikimedia.org/T400367)
[19:45:02] <wikibugs>	 06SRE: FY 25/26 WE 5.4.2: Known bots / clients - https://phabricator.wikimedia.org/T400100#11032737 (10Scott_French)
[19:45:09] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1030.eqiad.wmnet with OS bookworm
[19:47:13] <wikibugs>	 (03PS1) 10Dzahn: add blubber builder config to build zoekt [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199)
[19:47:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] add blubber builder config to build zoekt [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[19:47:47] <wikibugs>	 (03PS1) 10BCornwall: ncredir: Redirect wikipedialibrary.org [puppet] - 10https://gerrit.wikimedia.org/r/1172391 (https://phabricator.wikimedia.org/T400367)
[19:48:39] <wikibugs>	 (03PS1) 10BCornwall: acme-chief: Add wikipedialibrary.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1172393 (https://phabricator.wikimedia.org/T400367)
[19:49:36] <wikibugs>	 06SRE: FY 25/26 WE 5.4.2: Known bots / clients - https://phabricator.wikimedia.org/T400100#11032749 (10Scott_French)
[19:51:13] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6416/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172393 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall)
[19:55:29] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] Add wikipedialibrary.org to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1172389 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall)
[19:56:10] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] acme-chief: Add wikipedialibrary.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1172393 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall)
[19:57:40] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] Add wikipedialibrary.org to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1172389 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall)
[19:57:46] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] acme-chief: Add wikipedialibrary.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1172393 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall)
[19:58:59] <logmsgbot>	 !log brett@dns1004 START - running authdns-update
[19:59:19] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "#15 ERROR: "/go.sum" not found: not found" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[19:59:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T399249)', diff saved to https://phabricator.wikimedia.org/P79879 and previous config saved to /var/cache/conftool/dbconfig/20250724-195958-marostegui.json
[19:59:59] <logmsgbot>	 !log brett@dns1004 END - running authdns-update
[20:00:03] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T2000).
[20:00:04] <jouncebot>	 Daimona: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:23] <Daimona>	 o/
[20:00:25] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "upstream code I am looking at for this is line 25 - 32 in https://github.com/sourcebot-dev/sourcebot/blob/main/Dockerfile" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[20:00:53] <zabe>	 I can deploy
[20:01:22] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Enable the CampaignEvents extension on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172362 (https://phabricator.wikimedia.org/T397369) (owner: 10Daimona Eaytoy)
[20:01:34] <Daimona>	 Thank you!
[20:02:14] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the CampaignEvents extension on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172362 (https://phabricator.wikimedia.org/T397369) (owner: 10Daimona Eaytoy)
[20:02:39] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1172362|Enable the CampaignEvents extension on wikimaniawiki (T397369)]]
[20:02:44] <stashbot>	 T397369: Enable CampaignEvents Extension on wikimania  - https://phabricator.wikimedia.org/T397369
[20:03:16] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1031.eqiad.wmnet with OS bookworm
[20:04:36] <logmsgbot>	 !log zabe@deploy1003 zabe, daimona: Backport for [[gerrit:1172362|Enable the CampaignEvents extension on wikimaniawiki (T397369)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:04:46] <zabe>	 Daimona: can you test?
[20:06:41] <Daimona>	 Looks good aside from the usual caching issues like unavailable RL modules
[20:06:59] <logmsgbot>	 !log zabe@deploy1003 zabe, daimona: Continuing with sync
[20:12:26] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172362|Enable the CampaignEvents extension on wikimaniawiki (T397369)]] (duration: 09m 46s)
[20:12:31] <stashbot>	 T397369: Enable CampaignEvents Extension on wikimania  - https://phabricator.wikimedia.org/T397369
[20:12:35] <zabe>	 Daimona: should be live
[20:13:24] <Daimona>	 Can confirm, thank you!
[20:13:32] <zabe>	 yw
[20:13:33] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] ncredir: Redirect wikipedialibrary.org [puppet] - 10https://gerrit.wikimedia.org/r/1172391 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall)
[20:14:08] <wikibugs>	 (03PS1) 10DLynch: Enable DiscussionTools thanks on existing "report incident" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172397
[20:15:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P79880 and previous config saved to /var/cache/conftool/dbconfig/20250724-201506-marostegui.json
[20:16:27] <wikibugs>	 (03CR) 10Dzahn: "wmflabs.org is deprecated and wmcloud.org should be used instead. let's see if they published this somewhere already or can easily change " [puppet] - 10https://gerrit.wikimedia.org/r/1172391 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall)
[20:18:03] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1031.eqiad.wmnet with reason: host reimage
[20:23:21] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1031.eqiad.wmnet with reason: host reimage
[20:24:30] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] acme-chief: Add wikipedialibrary.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1172393 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall)
[20:26:39] <wikibugs>	 (03PS1) 10CDanis: haproxy: scrub part of x-analytics even when xwd debug [puppet] - 10https://gerrit.wikimedia.org/r/1172401
[20:26:39] <wikibugs>	 (03PS1) 10CDanis: varnish: include wmfuniq count in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1172402
[20:28:09] <wikibugs>	 (03PS2) 10DLynch: Enable DiscussionTools thanks on existing "report incident" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172397 (https://phabricator.wikimedia.org/T366095)
[20:28:28] <wikibugs>	 (03PS2) 10CDanis: haproxy: scrub part of x-analytics even when xwd debug [puppet] - 10https://gerrit.wikimedia.org/r/1172401
[20:28:28] <wikibugs>	 (03PS2) 10CDanis: varnish: include wmfuniq count in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1172402
[20:28:32] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1172401 (owner: 10CDanis)
[20:28:36] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1172402 (owner: 10CDanis)
[20:30:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P79881 and previous config saved to /var/cache/conftool/dbconfig/20250724-203013-marostegui.json
[20:45:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T399249)', diff saved to https://phabricator.wikimedia.org/P79882 and previous config saved to /var/cache/conftool/dbconfig/20250724-204521-marostegui.json
[20:45:26] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[20:45:36] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1253.eqiad.wmnet with reason: Maintenance
[20:45:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1253 (T399249)', diff saved to https://phabricator.wikimedia.org/P79883 and previous config saved to /var/cache/conftool/dbconfig/20250724-204543-marostegui.json
[20:50:48] <jinxer-wm>	 FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[20:50:48] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1031.eqiad.wmnet with OS bookworm
[20:52:29] <wikibugs>	 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412 (10RobH) 03NEW
[20:53:17] <wikibugs>	 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11032928 (10RobH) Please note Jaime already merged the puppet changes needed for this host, so not assigning to them for that just leaving in the racking task column on #ops-eqiad for when...
[21:00:02] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1032.eqiad.wmnet with OS bookworm
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T2100)
[21:12:52] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169198 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[21:13:41] <wikibugs>	 (03Merged) 10jenkins-bot: Set categorylinks to read new on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169198 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[21:14:08] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169198|Set categorylinks to read new on most wikis (T397912)]]
[21:14:12] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[21:16:06] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1169198|Set categorylinks to read new on most wikis (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:17:52] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[21:18:11] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1032.eqiad.wmnet with reason: host reimage
[21:23:19] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1032.eqiad.wmnet with reason: host reimage
[21:23:36] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169198|Set categorylinks to read new on most wikis (T397912)]] (duration: 09m 28s)
[21:23:40] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[21:25:07] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06SRE Observability: Logstash access for HCoplin - https://phabricator.wikimedia.org/T400288#11033037 (10HCoplin-WMF) I was indeed able to log in and request it that way! Thank you :)   Apologies for the double request here, if that's y'all's preferred method. You might want to...
[21:29:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T399249)', diff saved to https://phabricator.wikimedia.org/P79884 and previous config saved to /var/cache/conftool/dbconfig/20250724-212916-marostegui.json
[21:29:21] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[21:44:08] <Dreamy_Jazz>	 jouncebot: nowandnext
[21:44:08] <jouncebot>	 For the next 0 hour(s) and 15 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250724T2100)
[21:44:08] <jouncebot>	 In 8 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250725T0600)
[21:44:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P79885 and previous config saved to /var/cache/conftool/dbconfig/20250724-214423-marostegui.json
[21:51:21] <logmsgbot>	 !log dreamyjazz Deployed security patch for T399093
[21:59:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P79886 and previous config saved to /var/cache/conftool/dbconfig/20250724-215931-marostegui.json
[22:00:28] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.191.0" for 2 host(s)
[22:02:18] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.191.0" completed for 2 hosts
[22:04:51] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: (no justification provided)
[22:07:56] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:08:23] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: (no justification provided) (duration: 03m 36s)
[22:09:28] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10Mail: Access Request to DMarcDigests - https://phabricator.wikimedia.org/T399976#11033187 (10nisrael) Just confirming for visibility here, I was able to gain access!
[22:12:28] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:14:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T399249)', diff saved to https://phabricator.wikimedia.org/P79887 and previous config saved to /var/cache/conftool/dbconfig/20250724-221439-marostegui.json
[22:14:44] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[22:14:54] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[22:15:15] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1022.eqiad.wmnet with OS bookworm
[22:15:26] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1032.eqiad.wmnet with OS bookworm
[22:15:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11033199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm
[22:21:53] <wikibugs>	 (03PS1) 10Dreamy Jazz: Make SecurePoll channel log warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172422
[22:22:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172422 (owner: 10Dreamy Jazz)
[22:22:35] <Dreamy_Jazz>	 Security deploy failed to apply properly. Using a scap backport to reset the patches back to before.
[22:23:04] <wikibugs>	 (03Merged) 10jenkins-bot: Make SecurePoll channel log warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172422 (owner: 10Dreamy Jazz)
[22:23:16] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1172422|Make SecurePoll channel log warnings]]
[22:25:16] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1172422|Make SecurePoll channel log warnings]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:26:05] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[22:27:44] <wikibugs>	 06SRE, 10WMF-General-or-Unknown: Outdated link in message that is sent for forbidden connections - https://phabricator.wikimedia.org/T400421#11033218 (10Pppery)
[22:31:19] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172422|Make SecurePoll channel log warnings]] (duration: 08m 03s)
[22:36:03] <wikibugs>	 (03CR) 10Clare Ming: [C:03+1] "just fyi, i think web team is running an a/a test on testwiki so they may be relying on these configs for data collection -- depending on " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172279 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx)
[22:42:03] <logmsgbot>	 !log dreamyjazz Deployed security patch for T399093
[22:44:21] <wikibugs>	 (03CR) 10ZhaoFJx: [C:03+1] zhwiki: Allow local securepoll setup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang)
[22:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[23:03:29] <logmsgbot>	 vriley@cumin1002 reimage (PID 3112741) is awaiting input
[23:04:02] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1022.eqiad.wmnet with OS bookworm
[23:04:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11033279 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with...
[23:04:38] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:06:54] <wikibugs>	 06SRE, 06serviceops, 10Wikifeeds: 504 responses (gateway timeout) for /api/rest_v1/feed/featured - https://phabricator.wikimedia.org/T400425 (10Scott_French) 03NEW
[23:09:03] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[23:10:48] <jinxer-wm>	 RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[23:11:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:11:47] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1022.eqiad.wmnet with OS bookworm
[23:11:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11033292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm
[23:12:05] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:12:17] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:13:24] <wikibugs>	 06SRE, 06serviceops, 10Wikifeeds: 504 responses (gateway timeout) for /api/rest_v1/feed/featured - https://phabricator.wikimedia.org/T400425#11033296 (10Scott_French)
[23:13:55] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:14:07] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:27:54] <logmsgbot>	 vriley@cumin1002 reimage (PID 3119536) is awaiting input
[23:34:27] <wikibugs>	 06SRE, 06Traffic: Outdated link in message that is sent for forbidden connections - https://phabricator.wikimedia.org/T400421#11033311 (10bd808) `lang=shell-session $ git grep https://meta.wikimedia.org/wiki/User-Agent_policy modules/profile/files/wmcs/services/maintain_dbusers/maintain_dbusers.py:    # https:...
[23:37:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:38:29] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172433
[23:38:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172433 (owner: 10TrainBranchBot)
[23:42:34] <wikibugs>	 (03PS1) 10BryanDavis: wmcs: Update URL in comment in maintain_dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/1172434 (https://phabricator.wikimedia.org/T400421)
[23:42:36] <wikibugs>	 (03PS1) 10BryanDavis: varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421)
[23:44:49] <wikibugs>	 (03PS2) 10BryanDavis: varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421)
[23:50:40] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Outdated link to User-Agent Policy in Varnish 403 and 429 responses - https://phabricator.wikimedia.org/T400421#11033332 (10bd808)
[23:51:50] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172433 (owner: 10TrainBranchBot)
[23:52:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown