[00:00:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10065688 (10Dwisehaupt) frdb2004 OS install complete. Will clone the DB across tomorrow.
[00:09:04] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062767 (owner: 10TrainBranchBot)
[00:15:58] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1062753 (https://phabricator.wikimedia.org/T372507) (owner: 10Scott French)
[00:19:11] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1062754 (https://phabricator.wikimedia.org/T372507) (owner: 10Scott French)
[00:27:48] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:29:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:36:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[00:49:19] <wikibugs>	 (03PS1) 10Eevans: aqs1022: provision new host for hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514)
[00:52:05] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10065708 (10Eevans)
[00:52:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[01:07:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[01:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:29:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:21:17] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "No worries! And thanks for your patience while it took me a little while to get back to this." [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm)
[02:23:39] <logmsgbot>	 !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@02f37cf]: (no justification provided)
[02:24:22] <logmsgbot>	 !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@02f37cf]: (no justification provided) (duration: 00m 43s)
[02:39:25] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:59:25] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:04:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:22:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[03:24:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10065781 (10Papaul) @Dwisehaupt you had the DNS information under Description and not under DNS Name see below ` IP Address Family  IPv4 VRF  Global Tenant  Fundraising Tech S...
[03:26:50] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix mgmt DNS fro fd2004 - pt1979@cumin2002"
[03:26:55] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix mgmt DNS fro fd2004 - pt1979@cumin2002"
[03:26:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[03:39:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10065796 (10phaultfinder)
[03:53:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:27:48] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:33:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:36:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[04:49:22] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s3 T372393
[04:49:25] <stashbot>	 T372393: Switchover s3 master (db1223 -> db1189) - https://phabricator.wikimedia.org/T372393
[04:49:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1189 with weight 0 T372393', diff saved to https://phabricator.wikimedia.org/P67323 and previous config saved to /var/cache/conftool/dbconfig/20240815-044929-root.json
[04:49:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T372393
[04:49:58] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1062381 (https://phabricator.wikimedia.org/T372393) (owner: 10Gerrit maintenance bot)
[04:55:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Stop MariaDB on db1238 T371342
[04:55:40] <stashbot>	 T371342: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342
[04:55:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Stop MariaDB on db1238 T371342
[04:56:30] <wikibugs>	 (03PS1) 10Marostegui: db1238: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1062780 (https://phabricator.wikimedia.org/T371342)
[04:59:23] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1238: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1062780 (https://phabricator.wikimedia.org/T371342) (owner: 10Marostegui)
[05:01:05] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10065828 (10Marostegui) @VRiley-WMF you can proceed whenever you want. The host is ready.
[05:03:56] <marostegui>	 !log Starting s3 eqiad failover from db1223 to db1189 - T372393
[05:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:59] <stashbot>	 T372393: Switchover s3 master (db1223 -> db1189) - https://phabricator.wikimedia.org/T372393
[05:04:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T372393', diff saved to https://phabricator.wikimedia.org/P67324 and previous config saved to /var/cache/conftool/dbconfig/20240815-050410-root.json
[05:04:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1189 to s3 primary and set section read-write T372393', diff saved to https://phabricator.wikimedia.org/P67325 and previous config saved to /var/cache/conftool/dbconfig/20240815-050428-root.json
[05:04:59] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062382 (https://phabricator.wikimedia.org/T372393)
[05:05:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062382 (https://phabricator.wikimedia.org/T372393) (owner: 10Gerrit maintenance bot)
[05:05:17] <wikibugs>	 (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062382 (https://phabricator.wikimedia.org/T372393) (owner: 10Gerrit maintenance bot)
[05:06:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1223 T372393', diff saved to https://phabricator.wikimedia.org/P67326 and previous config saved to /var/cache/conftool/dbconfig/20240815-050613-root.json
[05:07:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67327 and previous config saved to /var/cache/conftool/dbconfig/20240815-050701-root.json
[05:08:59] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1062782 (https://phabricator.wikimedia.org/T372524)
[05:09:04] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062783 (https://phabricator.wikimedia.org/T372524)
[05:17:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:22:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67328 and previous config saved to /var/cache/conftool/dbconfig/20240815-052206-root.json
[05:29:59] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1062788
[05:33:01] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1062788 (owner: 10Marostegui)
[05:37:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67329 and previous config saved to /var/cache/conftool/dbconfig/20240815-053712-root.json
[05:52:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67330 and previous config saved to /var/cache/conftool/dbconfig/20240815-055218-root.json
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T0600).
[06:00:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventstreams.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:07:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67331 and previous config saved to /var/cache/conftool/dbconfig/20240815-060723-root.json
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:15:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventstreams.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:22:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67332 and previous config saved to /var/cache/conftool/dbconfig/20240815-062229-root.json
[06:31:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[06:31:44] <jinxer-wm>	 Deployment eventstreams-production in eventstreams at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=eventstreams&var-deployment=eventstreams-production - ...
[06:31:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[06:37:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67333 and previous config saved to /var/cache/conftool/dbconfig/20240815-063734-root.json
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:09:57] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye
[07:10:03] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10065976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye
[07:17:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10066002 (10JMeybohm) >>! In T371422#10064622, @VRiley-WMF wrote: > I have allocated these drives added these SSDs to specified servers. Please test it out and let us k...
[07:31:01] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 10:00:00 on 9 hosts with reason: T364368 non-prod hosts
[07:31:04] <stashbot>	 T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368
[07:31:16] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 10:00:00 on 9 hosts with reason: T364368 non-prod hosts
[07:47:10] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye
[07:47:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066031 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with error...
[07:49:11] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans)
[07:50:24] <wikibugs>	 (03PS1) 10David Caro: ceph: add alert when we get no data from the cluster [alerts] - 10https://gerrit.wikimedia.org/r/1062962 (https://phabricator.wikimedia.org/T372528)
[08:00:05] <jouncebot>	 jeena and jnuche: OwO what's this, a deployment window?? MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T0800). nyaa~
[08:00:24] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2006.codfw.wmnet with OS bullseye
[08:00:36] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066038 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye executed with error...
[08:01:39] <wikibugs>	 (03PS3) 10AOkoth: vrts: build & install packages [cookbooks] - 10https://gerrit.wikimedia.org/r/1062715 (https://phabricator.wikimedia.org/T366078)
[08:04:29] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye
[08:04:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066052 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye
[08:04:50] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: payload logging in revscoring-mp-articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 (owner: 10Ilias Sarantopoulos)
[08:16:41] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene)
[08:16:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[08:16:44] <jinxer-wm>	 Deployment eventstreams-production in eventstreams at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=eventstreams&var-deployment=eventstreams-production - ...
[08:16:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[08:16:49] <wikibugs>	 (03PS5) 10Stevemunene: dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760)
[08:18:56] <wikibugs>	 (03CR) 10Stevemunene: dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene)
[08:36:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[08:45:54] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10066125 (10Ifeatu_Nnaobi_WMDE) >>! In T371796#10043749, @Dzahn wrote: > @ifeatu_nnaobi_wmde Could you please send an email to [[ https://meta.wikimedia.org/wiki/U...
[08:55:13] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2006.codfw.wmnet with OS bullseye
[08:55:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye executed with error...
[08:59:02] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3641/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062471 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[09:05:04] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] "I have some doubts if there are any untracked packets at all but let's try this and see if Gerrit throttling behaves differently" [puppet] - 10https://gerrit.wikimedia.org/r/1062471 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[09:08:47] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10066183 (10eoghan) a:03Ifeatu_Nnaobi_WMDE
[09:09:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10066184 (10eoghan) a:05Ifeatu_Nnaobi_WMDE→03eoghan
[09:11:36] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] "After merging this I'm added to the DENYLIST immediately when visiting my Gerrit dashboard." [puppet] - 10https://gerrit.wikimedia.org/r/1062471 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[09:17:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:24:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2152.codfw.wmnet with reason: Maintenance
[09:24:56] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2152.codfw.wmnet with reason: Maintenance
[09:25:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T367856)', diff saved to https://phabricator.wikimedia.org/P67334 and previous config saved to /var/cache/conftool/dbconfig/20240815-092502-marostegui.json
[09:25:06] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[09:27:20] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene)
[09:27:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2152.codfw.wmnet with reason: Schema change
[09:27:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2152.codfw.wmnet with reason: Schema change
[09:35:43] <wikibugs>	 (03CR) 10Stevemunene: [V:03+2 C:03+2] dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene)
[09:39:27] <wikibugs>	 (03PS1) 10JMeybohm: preseed: Switch to reuse receipts for kafka-main20(06,09,10) [puppet] - 10https://gerrit.wikimedia.org/r/1062967 (https://phabricator.wikimedia.org/T371423)
[09:43:11] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10066264 (10VRiley-WMF) Worked with Dell on this. Orginally, they wanted to update firmware before anything else. However, I provided them with TSR reports. They informed me it'...
[09:43:14] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1062967 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm)
[09:44:13] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10066265 (10Marostegui) Thanks! I am going to get it online and I will close the task once it is repooled today. If we see something strange we'll reopen  Thank you!
[09:45:16] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1238: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1062968
[09:48:17] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] preseed: Switch to reuse receipts for kafka-main20(06,09,10) [puppet] - 10https://gerrit.wikimedia.org/r/1062967 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm)
[09:50:58] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1238: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1062968 (owner: 10Marostegui)
[09:55:18] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye
[09:55:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066344 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bu...
[09:56:53] <wikibugs>	 (03PS1) 10Hnowlan: php: fix bug in min_avail_workers healthz behaviour [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1062971 (https://phabricator.wikimedia.org/T372521)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1000)
[10:03:12] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10066386 (10VRiley-WMF) 05Open→03Resolved Awesome, I'll go ahead and mark this as resolved for now.
[10:04:07] <wikibugs>	 (03CR) 10David Caro: [C:03+2] wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 (owner: 10David Caro)
[10:04:10] <wikibugs>	 (03CR) 10David Caro: [C:03+2] wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 (owner: 10David Caro)
[10:06:02] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] php: fix bug in min_avail_workers healthz behaviour [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1062971 (https://phabricator.wikimedia.org/T372521) (owner: 10Hnowlan)
[10:10:31] <wikibugs>	 (03CR) 10Hnowlan: [V:03+2 C:03+2] php: fix bug in min_avail_workers healthz behaviour [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1062971 (https://phabricator.wikimedia.org/T372521) (owner: 10Hnowlan)
[10:11:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67335 and previous config saved to /var/cache/conftool/dbconfig/20240815-101139-root.json
[10:11:54] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10066428 (10Marostegui) Host being automatically repooled.
[10:15:25] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage
[10:18:12] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage
[10:19:56] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[10:21:37] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[10:22:51] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-10.6-bookworm: 10.6.19 is out [software] - 10https://gerrit.wikimedia.org/r/1062975 (https://phabricator.wikimedia.org/T372536)
[10:23:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] control-mariadb-10.6-bookworm: 10.6.19 is out [software] - 10https://gerrit.wikimedia.org/r/1062975 (https://phabricator.wikimedia.org/T372536) (owner: 10Marostegui)
[10:23:57] <wikibugs>	 (03Merged) 10jenkins-bot: control-mariadb-10.6-bookworm: 10.6.19 is out [software] - 10https://gerrit.wikimedia.org/r/1062975 (https://phabricator.wikimedia.org/T372536) (owner: 10Marostegui)
[10:26:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67336 and previous config saved to /var/cache/conftool/dbconfig/20240815-102645-root.json
[10:27:10] <marostegui>	 !log Install 10.6.19 on pc1014 db1125 pc2014 T372536
[10:27:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:12] <stashbot>	 T372536: Compile and package MariaDB 10.6.19 - https://phabricator.wikimedia.org/T372536
[10:27:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on pc2014.codfw.wmnet with reason: Upgrade to 10.6.19
[10:27:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet with reason: Upgrade to 10.6.19
[10:28:09] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on pc1014.eqiad.wmnet with reason: Upgrade to 10.6.19
[10:28:22] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc1014.eqiad.wmnet with reason: Upgrade to 10.6.19
[10:28:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1125.eqiad.wmnet with reason: Upgrade to 10.6.19
[10:29:01] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1125.eqiad.wmnet with reason: Upgrade to 10.6.19
[10:36:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2006.codfw.wmnet with OS bullseye
[10:37:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye completed: - kafka-...
[10:41:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67337 and previous config saved to /var/cache/conftool/dbconfig/20240815-104150-root.json
[10:49:10] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for SaraiSan WMF - https://phabricator.wikimedia.org/T372290#10066506 (10eoghan) 05Open→03Resolved Confirmed working!
[10:51:42] <wikibugs>	 (03CR) 10Hnowlan: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[10:56:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67338 and previous config saved to /var/cache/conftool/dbconfig/20240815-105656-root.json
[11:00:47] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:04:31] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:12:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67339 and previous config saved to /var/cache/conftool/dbconfig/20240815-111201-root.json
[11:12:30] <wikibugs>	 (03PS1) 10Hnowlan: (de|uk|ja|he|fi)wiki: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062979 (https://phabricator.wikimedia.org/T369048)
[11:24:41] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[11:27:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67340 and previous config saved to /var/cache/conftool/dbconfig/20240815-112707-root.json
[11:27:27] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[11:30:03] <wikibugs>	 (03PS3) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[11:30:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[11:42:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67341 and previous config saved to /var/cache/conftool/dbconfig/20240815-114213-root.json
[11:44:12] <wikibugs>	 (03PS4) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[11:44:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[11:48:14] <wikibugs>	 (03PS5) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[11:48:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1200)
[12:08:11] <wikibugs>	 (03PS6) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[12:09:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[12:09:57] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye
[12:10:00] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye
[12:10:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066693 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye
[12:10:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye
[12:12:36] <wikibugs>	 (03PS7) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[12:14:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[12:14:48] <wikibugs>	 (03CR) 10Klausman: [C:03+2] knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 (owner: 10Klausman)
[12:15:44] <wikibugs>	 (03Abandoned) 10Klausman: knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 (owner: 10Klausman)
[12:16:13] <wikibugs>	 (03Restored) 10Klausman: knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 (owner: 10Klausman)
[12:16:53] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 (owner: 10Klausman)
[12:16:59] <wikibugs>	 (03PS1) 10Seddon: Save the request before starting the automatic vanish job [extensions/CentralAuth] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062996 (https://phabricator.wikimedia.org/T372006)
[12:17:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062996 (https://phabricator.wikimedia.org/T372006) (owner: 10Seddon)
[12:18:29] <wikibugs>	 (03PS8) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[12:20:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[12:20:07] <wikibugs>	 (03Merged) 10jenkins-bot: knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 (owner: 10Klausman)
[12:20:32] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062999
[12:23:29] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[12:25:11] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[12:26:24] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[12:26:55] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[12:28:57] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage
[12:29:16] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2009.codfw.wmnet with reason: host reimage
[12:30:41] <wikibugs>	 (03PS9) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[12:32:22] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage
[12:34:56] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2009.codfw.wmnet with reason: host reimage
[12:36:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[12:47:24] <wikibugs>	 (03PS10) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[12:47:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[12:49:47] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2010.codfw.wmnet with OS bullseye
[12:49:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye completed: - kafka-...
[12:50:16] <wikibugs>	 (03PS11) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[12:51:26] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3650/console" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[12:52:30] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2009.codfw.wmnet with OS bullseye
[12:52:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066741 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye completed: - kafka-...
[12:52:52] <wikibugs>	 (03PS12) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[12:54:22] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[12:55:55] <wikibugs>	 (03PS13) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[12:57:29] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3652/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1300).
[13:00:05] <jouncebot>	 seddon: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:01:48] <Lucas_WMDE>	 I can probably deploy in a few minutes
[13:04:01] <Lucas_WMDE>	 (if Seddon is around)
[13:04:06] <Seddon>	 I am1
[13:04:07] <Seddon>	 !
[13:04:25] <Lucas_WMDE>	 ok, then I can deploy now
[13:05:32] <Seddon>	 Cool I'll gert myself set up
[13:05:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062996 (https://phabricator.wikimedia.org/T372006) (owner: 10Seddon)
[13:05:51] <Lucas_WMDE>	 CentralAuth doesn’t have any weird special train / deployment branch stuff, right?
[13:05:54] <Lucas_WMDE>	 that’s CentralNotice I’m thinking of
[13:08:01] <Seddon>	 I'm not away of anything funny like with CN
[13:08:08] <Lucas_WMDE>	 yeah, it looks like nothing special had to be done for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1054574 either
[13:08:14] <Lucas_WMDE>	 I’ll just hope for the best then ^^
[13:08:19] <wikibugs>	 (03PS1) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882)
[13:15:23] <wikibugs>	 (03Merged) 10jenkins-bot: Save the request before starting the automatic vanish job [extensions/CentralAuth] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062996 (https://phabricator.wikimedia.org/T372006) (owner: 10Seddon)
[13:15:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1062996|Save the request before starting the automatic vanish job (T372006)]]
[13:16:19] <stashbot>	 T372006: Unblock stuck global rename of multiple users - https://phabricator.wikimedia.org/T372006
[13:17:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:19:11] <wikibugs>	 (03PS2) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882)
[13:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:22:23] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[13:23:55] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[13:24:08] <Lucas_WMDE>	 that k8s image build sure is taking a while
[13:24:09] * Lucas_WMDE looks
[13:24:51] <Lucas_WMDE>	 okay, it’s making some progress again
[13:25:09] <Lucas_WMDE>	 looks like it had been pushing an image to the registry from 13:18 until about 13:24, which feels longer than it should be
[13:25:58] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[13:25:59] <Lucas_WMDE>	 anyway, that finished, now docker_pull_k8s is running
[13:26:12] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[13:26:15] <Lucas_WMDE>	 (also quite slowly… maybe the image diff is bigger than usual, though idk why it would be)
[13:26:56] <Seddon>	 Long backports seems to be the norm with mw-on-k8s
[13:27:38] <cdanis>	 Lucas_WMDE: that definitely feels longer than it should take
[13:27:41] <wikibugs>	 (03PS3) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882)
[13:27:55] <Lucas_WMDE>	 Seddon: it got a lot better in the meantime tbh
[13:28:07] <Lucas_WMDE>	 claime: docker_pull_k8s is currently at 15% after 4 minutes fwiw
[13:28:50] <Lucas_WMDE>	 the image that took long to push was docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-08-15-131615-publish btw
[13:29:11] <Lucas_WMDE>	 (whereas pushing docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-08-14-180507-webserver only took two seconds)
[13:29:37] <Lucas_WMDE>	 (oh wait, that one’s a timestamp from yesterday so it probably had nothing to do anyway)
[13:30:14] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[13:30:29] <ihurbain>	 (i don't know what c.laime specifically is doing, but today is a holiday in france, just for the record)
[13:30:42] <wikibugs>	 (03PS4) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882)
[13:31:08] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[13:31:43] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560 (10MatthewVernon) 03NEW
[13:31:44] <Lucas_WMDE>	 ihurbain: thanks, I totally forgot about that – it’s not a holiday in godless berlin 😔
[13:31:50] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' .
[13:31:53] <Lucas_WMDE>	 (but it is in bavaria and a few other states. federalism!)
[13:32:08] <Lucas_WMDE>	 oh wait, I meant to ping cdanis anyway 🤦
[13:32:25] <ihurbain>	 Lucas_WMDE: it's not in kanton zürich either, but it is in kanton zug next to it (yay federalism too :P )
[13:32:32] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[13:32:34] <ihurbain>	 haha :D
[13:32:36] <Lucas_WMDE>	 :D
[13:32:49] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[13:33:30] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[13:33:45] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[13:34:01] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[13:34:13] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:34:29] <Lucas_WMDE>	 docker_pull_k8s finished, yay
[13:34:37] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:34:55] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:35:04] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560#10066857 (10MatthewVernon) p:05Triage→03High
[13:38:07] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[13:40:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 seddon, lucaswerkmeister-wmde: Backport for [[gerrit:1062996|Save the request before starting the automatic vanish job (T372006)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:40:40] <Lucas_WMDE>	 at last!
[13:40:42] <Lucas_WMDE>	 Seddon: please test ^^
[13:40:46] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:40:47] <Lucas_WMDE>	 (if possible)
[13:40:48] <stashbot>	 T372006: Unblock stuck global rename of multiple users - https://phabricator.wikimedia.org/T372006
[13:41:39] <Seddon>	 @Lucas_WMDE mwdebug?
[13:41:42] <Seddon>	 Or production?
[13:41:47] <Lucas_WMDE>	 mwdebug so far
[13:41:51] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[13:42:08] <Seddon>	 Lucas_WMDE: equiad 1002?
[13:42:21] <Lucas_WMDE>	 mwdebug-k8s normally
[13:42:34] <Lucas_WMDE>	 but it should be synced to all of them
[13:43:23] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[13:43:29] <Seddon>	 We are all good
[13:44:39] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[13:44:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 seddon, lucaswerkmeister-wmde: Continuing with sync
[13:44:46] <Lucas_WMDE>	 ok, thanks!
[13:45:33] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[13:46:12] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:47:59] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:49:51] <wikibugs>	 (03PS7) 10Ssingh: P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366)
[13:50:26] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:50:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062996|Save the request before starting the automatic vanish job (T372006)]] (duration: 34m 44s)
[13:51:05] <sukhe>	 !log sudo cumin "A:dnsbox" 'disable-puppet "merging CR 1053929 T369366"'
[13:51:31] <sukhe>	 hmm
[13:51:46] <wikibugs>	 (03PS5) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882)
[13:51:57] <stashbot>	 T372006: Unblock stuck global rename of multiple users - https://phabricator.wikimedia.org/T372006
[13:51:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:11] <Lucas_WMDE>	 looks like stashbot was lagging a bit
[13:52:26] <stashbot>	 T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366
[13:52:30] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:52:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:34] <Lucas_WMDE>	 Seddon: should be deployed everywhere now
[13:54:02] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3657/co" [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[13:54:12] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission payments2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371631#10066924 (10Jhancock.wm) 05Open→03Resolved
[13:54:16] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4004.wikimedia.org,service=recdns [reason: admin_state migration test]
[13:54:22] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4004.wikimedia.org [reason: admin_state migration test]
[13:54:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission payments2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371630#10066925 (10Jhancock.wm) 05Open→03Resolved
[13:54:41] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[13:54:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:59:08] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: testing on dns4004, no task ID specified]
[13:59:18] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: testing on dns4004, no task ID specified]
[13:59:45] <wikibugs>	 (03PS1) 10JMeybohm: reimage: Don't fail when mkfs takes a long time [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006
[13:59:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:59:57] <wikibugs>	 (03PS6) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882)
[14:00:20] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site magru for service: text-addrs|text-next [reason: testing on dns4004, no task ID specified]
[14:00:25] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site magru for service: text-addrs|text-next [reason: testing on dns4004, no task ID specified]
[14:02:16] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[14:04:45] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site magru [reason: testing on dns4004, no task ID specified]
[14:04:47] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site magru [reason: testing on dns4004, no task ID specified]
[14:04:50] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: testing on dns4004, no task ID specified]
[14:04:51] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: testing on dns4004, no task ID specified]
[14:06:22] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS bullseye
[14:06:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye
[14:11:47] <wikibugs>	 (03PS1) 10Ebernhardson: Revert^2 "Search update pipeline: consume consolidated page-weighted-tags-change-stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063007
[14:14:35] <wikibugs>	 (03PS1) 10Ssingh: P:dns::auth::update: fix location of admin_state file [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366)
[14:14:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:dns::auth::update: fix location of admin_state file [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[14:15:44] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[14:15:57] <wikibugs>	 (03PS2) 10Ssingh: P:dns::auth::update: fix location of admin_state file [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366)
[14:17:01] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[14:17:55] <logmsgbot>	 !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-main2007.codfw.wmnet with OS bullseye
[14:18:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066995 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye executed with error...
[14:18:43] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns::auth::update: fix location of admin_state file [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[14:18:44] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] Revert^2 "Search update pipeline: consume consolidated page-weighted-tags-change-stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063007 (owner: 10Ebernhardson)
[14:19:42] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "Search update pipeline: consume consolidated page-weighted-tags-change-stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063007 (owner: 10Ebernhardson)
[14:21:05] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[14:21:11] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:24:29] <wikibugs>	 (03PS2) 10JMeybohm: reimage: Don't fail when mkfs takes a long time [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006
[14:25:09] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS bullseye
[14:25:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10067026 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye
[14:33:42] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: testing on dns4004, no task ID specified]
[14:33:46] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: testing on dns4004, no task ID specified]
[14:35:55] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[14:36:18] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: testing on dns4004, no task ID specified]
[14:36:21] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: testing on dns4004, no task ID specified]
[14:39:25] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:41:08] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[14:42:30] <wikibugs>	 (03CR) 10Eevans: aqs1022: provision new host for hardware refresh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans)
[14:43:19] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[14:44:43] <wikibugs>	 (03CR) 10Eevans: aqs1022: provision new host for hardware refresh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans)
[14:44:55] <wikibugs>	 (03PS2) 10Eevans: aqs1022: provision new host for hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514)
[14:46:27] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[14:47:36] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[14:48:21] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[14:49:28] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[14:53:39] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[14:57:58] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[14:59:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:00:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10067165 (10Dwisehaupt) @Papaul Oh, thanks for finding that. I didn't spot the wrong field when updating the typo. DNS looks good now. Thanks!
[15:00:04] <jouncebot>	 jeena and jnuche: Your horoscope predicts another Train log triage deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1500).
[15:00:23] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site magru [reason: testing on dns4004, no task ID specified]
[15:00:24] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site magru [reason: testing on dns4004, no task ID specified]
[15:00:38] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site esams [reason: testing on dns4004, no task ID specified]
[15:00:39] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site esams [reason: testing on dns4004, no task ID specified]
[15:00:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:02:58] <wikibugs>	 (03CR) 10MVernon: [C:03+1] aqs1022: provision new host for hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans)
[15:03:49] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: show site None [reason: no reason specified, no task ID specified]
[15:03:49] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: show site None [reason: no reason specified, no task ID specified]
[15:04:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:09:21] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: show site None [reason: no reason specified, no task ID specified]
[15:09:22] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: show site None [reason: no reason specified, no task ID specified]
[15:09:24] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site esams [reason: no reason specified, no task ID specified]
[15:09:26] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site esams [reason: no reason specified, no task ID specified]
[15:14:05] <wikibugs>	 (03PS1) 10Ssingh: P:dns::auth::update set confd_admin_state true for all DNS boxes [puppet] - 10https://gerrit.wikimedia.org/r/1063012 (https://phabricator.wikimedia.org/T369366)
[15:15:44] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1063012 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[15:17:14] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns::auth::update set confd_admin_state true for all DNS boxes [puppet] - 10https://gerrit.wikimedia.org/r/1063012 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[15:18:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] ceph: add alert when we get no data from the cluster [alerts] - 10https://gerrit.wikimedia.org/r/1062962 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro)
[15:19:19] <wikibugs>	 (03CR) 10David Caro: [C:03+2] ceph: add alert when we get no data from the cluster [alerts] - 10https://gerrit.wikimedia.org/r/1062962 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro)
[15:20:33] <wikibugs>	 (03Merged) 10jenkins-bot: ceph: add alert when we get no data from the cluster [alerts] - 10https://gerrit.wikimedia.org/r/1062962 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro)
[15:20:55] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4004.wikimedia.org [reason: moving ahead with admin_state migration]
[15:21:01] <sukhe>	 !log running authdns-update
[15:21:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:32] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: show site None [reason: no reason specified, no task ID specified]
[15:21:32] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: show site None [reason: no reason specified, no task ID specified]
[15:27:32] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2008.codfw.wmnet with OS bullseye
[15:29:04] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10067228 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2008.codfw.wmnet with OS bullseye
[15:30:16] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[15:30:25] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:31:06] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:31:17] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:31:39] <wikibugs>	 (03PS3) 10Hnowlan: shellbox: allow readinessCheck parameters to be passed in values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309)
[15:32:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:34:34] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "Generally LGTM, seems like the least bad option." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm)
[15:37:26] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] Add policy to allow GeoIP hostPath volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054905 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm)
[15:41:34] <wikibugs>	 (03PS2) 10Ssingh: Remove admin_state handling from ops/dns [dns] - 10https://gerrit.wikimedia.org/r/1059140 (owner: 10BBlack)
[15:42:44] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs1022: provision new host for hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans)
[15:43:15] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Remove admin_state handling from ops/dns [dns] - 10https://gerrit.wikimedia.org/r/1059140 (owner: 10BBlack)
[15:43:25] <wikibugs>	 (03PS1) 10EoghanGaffney: apt-staging: Add check for packages in protected branches [puppet] - 10https://gerrit.wikimedia.org/r/1063015
[15:43:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] apt-staging: Add check for packages in protected branches [puppet] - 10https://gerrit.wikimedia.org/r/1063015 (owner: 10EoghanGaffney)
[15:43:52] <sukhe>	 !log running authdns-update
[15:43:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:48] <wikibugs>	 (03PS2) 10EoghanGaffney: apt-staging: Add check for packages in protected branches [puppet] - 10https://gerrit.wikimedia.org/r/1063015
[15:44:55] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10067280 (10Eevans) a:05Eevans→03None
[15:45:17] <sukhe>	 !log running authdns-update again
[15:45:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:29] <wikibugs>	 (03CR) 10Dzahn: "thanks for merging and testing this" [puppet] - 10https://gerrit.wikimedia.org/r/1062471 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[15:48:55] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.netbox
[15:49:15] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "Maybe s/but-ptrace/except-ptrace/g ? Not a strong opinion though." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054891 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm)
[15:51:15] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:52:48] <sukhe>	 !log sudo cumin -b1 -s60 "A:dnsbox" "run-puppet-agent --enable 'merging CR 1053929 T369366'": T369366
[15:52:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:19] <stashbot>	 T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366
[15:53:53] <SandraEbele_>	 !log reran druid_load_geoeditors_monthly, cassandra_load_editors_by_country_monthly, and druid_load_edit_hourly airflow dags with run_id scheduled__2024-06-01T00:00:00+00:00 as part of down stream tasks after rerunning mediawiki_history_denormalize for 2024-06 snapshot.
[15:53:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:09] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2007.codfw.wmnet with OS bullseye
[15:55:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10067297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye executed with error...
[15:59:37] <wikibugs>	 (03PS3) 10JMeybohm: reimage: Don't fail when mkfs takes a long time [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006
[16:00:04] <jouncebot>	 jhathaway and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1600)
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:02:02] <wikibugs>	 (03PS1) 10David Caro: ceph: use the right metric for unknown alert [alerts] - 10https://gerrit.wikimedia.org/r/1063017 (https://phabricator.wikimedia.org/T372528)
[16:03:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] ceph: use the right metric for unknown alert [alerts] - 10https://gerrit.wikimedia.org/r/1063017 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro)
[16:03:32] <wikibugs>	 (03CR) 10David Caro: [C:03+2] ceph: use the right metric for unknown alert [alerts] - 10https://gerrit.wikimedia.org/r/1063017 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro)
[16:04:43] <wikibugs>	 (03Merged) 10jenkins-bot: ceph: use the right metric for unknown alert [alerts] - 10https://gerrit.wikimedia.org/r/1063017 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro)
[16:05:26] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[16:06:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10067319 (10Dzahn) Ah! That's good, I think in this case we just need a confirmation from Katie and ask her to add you to the so called "NDA and MOU"-spreadsheet w...
[16:13:30] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] "For the short term given our p99 for" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[16:15:21] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: allow readinessCheck parameters to be passed in values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[16:15:53] <wikibugs>	 (03CR) 10Hnowlan: "Oops, hit submit too early. Given our p75 of ~20s this is definitely a concern for the future. Our p95 is generally about 45s but can spik" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[16:32:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:36:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[16:36:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10067421 (10Andrew) @cmooney can we get cloudcephosd1036 set up now that the switch work is done?
[16:44:20] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Stop general writes to private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332)
[16:44:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrus: Stop general writes to private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson)
[16:45:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson)
[16:51:18] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[16:51:50] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage
[16:51:58] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[16:52:58] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[16:53:54] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[16:54:55] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage
[17:03:10] <wikibugs>	 (03CR) 10Scott French: "Hmmm .. it looks like this was never applied?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061856 (https://phabricator.wikimedia.org/T371885) (owner: 10Filippo Giunchedi)
[17:07:24] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site ulsfo [reason: testing live change, T369366]
[17:07:35] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site ulsfo [reason: testing live change, T369366]
[17:07:40] <stashbot>	 T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366
[17:13:20] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2008.codfw.wmnet with OS bullseye
[17:13:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10067507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2008.codfw.wmnet with OS bullseye completed: - kafka-...
[17:18:14] <wikibugs>	 06SRE, 06Traffic: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#10067510 (10ssingh) 05Open→03Resolved a:03ssingh https://wikitech.wikimedia.org/wiki/DNS#Change_GeoDNS_/_Depool_a_Site
[17:19:10] <bd808>	 jouncebot: now
[17:19:10] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1700)
[17:19:10] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1700)
[17:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:19:37] * bd808 has nothing to deploy in his reserved window today
[17:22:17] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: testing done, T369366]
[17:22:28] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: testing done, T369366]
[17:22:33] <stashbot>	 T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366
[17:33:23] <wikibugs>	 (03PS1) 10Ssingh: sre.dns.admin: do not SAL on --show [cookbooks] - 10https://gerrit.wikimedia.org/r/1063028
[17:35:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10067540 (10KFrancis) Hi all, I'm confirming we have an NDA on file.  Thanks!
[17:39:33] <wikibugs>	 (03PS2) 10Ssingh: sre.dns.admin: do not SAL on --show [cookbooks] - 10https://gerrit.wikimedia.org/r/1063028
[17:41:08] <logmsgbot>	 !log dwisehaupt@cumin1002 START - Cookbook sre.dns.netbox
[17:44:35] <logmsgbot>	 !log dwisehaupt@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update mgmt dns for civi2002 frpig2002 - dwisehaupt@cumin1002"
[17:44:39] <logmsgbot>	 !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update mgmt dns for civi2002 frpig2002 - dwisehaupt@cumin1002"
[17:44:39] <logmsgbot>	 !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:44:58] <wikibugs>	 (03PS1) 10Scott French: mediawiki: consistently apply stats-global values via symlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063031 (https://phabricator.wikimedia.org/T365265)
[17:44:59] <wikibugs>	 (03PS1) 10Scott French: mw-debug: pilot bookworm statsd exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063032 (https://phabricator.wikimedia.org/T368366)
[17:45:01] <wikibugs>	 (03PS1) 10Scott French: mw-api-int: pilot bookworm statsd exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063033 (https://phabricator.wikimedia.org/T368366)
[17:45:03] <wikibugs>	 (03PS1) 10Scott French: mediawiki: upgrade all statsd exporters to bookworm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063034 (https://phabricator.wikimedia.org/T368366)
[17:52:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.dns.admin: do not SAL on --show [cookbooks] - 10https://gerrit.wikimedia.org/r/1063028 (owner: 10Ssingh)
[17:54:16] <wikibugs>	 (03PS3) 10Ssingh: sre.dns.admin: do not SAL on --show [cookbooks] - 10https://gerrit.wikimedia.org/r/1063028
[17:59:39] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: bump version, write weighted tags to ES [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063035 (https://phabricator.wikimedia.org/T372362)
[18:00:05] <jouncebot>	 jeena and jnuche: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1800).
[18:00:25] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade  on VRTS host vrts1001.eqiad.wmnet
[18:02:10] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0)  on VRTS host vrts1001.eqiad.wmnet
[18:03:34] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063036 (https://phabricator.wikimedia.org/T366963)
[18:03:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063036 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot)
[18:04:15] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063036 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot)
[18:15:00] <logmsgbot>	 !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.18  refs T366963
[18:15:20] <stashbot>	 T366963: 1.43.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T366963
[18:26:33] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] sre.dns.admin: do not SAL on --show [cookbooks] - 10https://gerrit.wikimedia.org/r/1063028 (owner: 10Ssingh)
[18:54:47] <tgr|away>	 !log running global rename cleanup script per T372006#10055573
[18:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:07] <stashbot>	 T372006: Unblock stuck global rename of multiple users - https://phabricator.wikimedia.org/T372006
[19:26:28] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062783 (https://phabricator.wikimedia.org/T372524) (owner: 10Gerrit maintenance bot)
[19:28:53] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Remove entries for payments2001 and payments2002 [dns] - 10https://gerrit.wikimedia.org/r/1062155 (https://phabricator.wikimedia.org/T371630) (owner: 10Dwisehaupt)
[19:38:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10067817 (10Jhancock.wm)
[19:42:16] <ebernhardson>	 jouncebot next
[19:42:16] <jouncebot>	 In 0 hour(s) and 17 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T2000)
[19:49:48] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <LogStash server or localhost port 9200> for <ecarg> - https://phabricator.wikimedia.org/T372445#10067847 (10ecarg) @JMeybohm I receive the following after replacing with that url:   ` curl -XGET 'https://logs-api.svc.eqiad.wmnet/_msearch?pretty&size=10000...
[19:54:40] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Revert "CommentFormatter: Switch from deprecated addJsConfigVars to new setJsConfigVar" [extensions/DiscussionTools] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063041 (https://phabricator.wikimedia.org/T372499)
[19:54:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/DiscussionTools] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063041 (https://phabricator.wikimedia.org/T372499) (owner: 10Bartosz Dziewoński)
[19:56:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10067859 (10Jhancock.wm)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T2000). Please do the needful.
[20:00:05] <jouncebot>	 ebernhardson and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:24] <MatmaRex>	 hi
[20:00:59] <MatmaRex>	 i just made my patch a minute ago, i haven't found anyone to review it yet, but it's hopefully a traightforward revert
[20:01:34] <ebernhardson>	 \o
[20:01:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10067873 (10Jhancock.wm)
[20:06:40] <MatmaRex>	 hmm, do we have any deployers around?
[20:07:59] <ebernhardson>	 MatmaRex: aren't you a deployer?
[20:08:07] <ebernhardson>	 if not i guess i can
[20:08:11] <MatmaRex>	 nope
[20:08:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [extensions/DiscussionTools] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063041 (https://phabricator.wikimedia.org/T372499) (owner: 10Bartosz Dziewoński)
[20:09:54] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] Search update pipeline: bump version, write weighted tags to ES [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063035 (https://phabricator.wikimedia.org/T372362) (owner: 10Peter Fischer)
[20:11:11] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Hugh!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan)
[20:19:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10067961 (10Jhancock.wm)
[20:20:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "CommentFormatter: Switch from deprecated addJsConfigVars to new setJsConfigVar" [extensions/DiscussionTools] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063041 (https://phabricator.wikimedia.org/T372499) (owner: 10Bartosz Dziewoński)
[20:20:16] <logmsgbot>	 !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1063041|Revert "CommentFormatter: Switch from deprecated addJsConfigVars to new setJsConfigVar" (T372499)]]
[20:20:40] <stashbot>	 T372499: InvalidArgumentException: Multiple conflicting values given for wgDiscussionToolsPageThreads (August 2024) - https://phabricator.wikimedia.org/T372499
[20:23:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10067966 (10Jhancock.wm) @MoritzMuehlenhoff when you have a moment, can you do this step for me please? thanks! Update the operations/puppet repo
[20:23:33] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson, matmarex: Backport for [[gerrit:1063041|Revert "CommentFormatter: Switch from deprecated addJsConfigVars to new setJsConfigVar" (T372499)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:23:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10067968 (10Jhancock.wm) @aborrero when you have a moment, can you do this step for me please? thanks!  Update the operations/puppet repo
[20:24:25] <MatmaRex>	 testing
[20:24:50] <MatmaRex>	 and it looks good
[20:25:13] <MatmaRex>	 (tested on https://en.wikivoyage.org/wiki/Template_talk:Related#Related_topics_&_empty_space_in_articles_bottom - with the fix applied, the page has "Reply" buttons again)
[20:25:40] <ebernhardson>	 MatmaRex: excellent, going forward
[20:25:46] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson, matmarex: Continuing with sync
[20:30:23] <logmsgbot>	 !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063041|Revert "CommentFormatter: Switch from deprecated addJsConfigVars to new setJsConfigVar" (T372499)]] (duration: 10m 06s)
[20:30:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson)
[20:31:06] <stashbot>	 T372499: InvalidArgumentException: Multiple conflicting values given for wgDiscussionToolsPageThreads (August 2024) - https://phabricator.wikimedia.org/T372499
[20:31:53] <wikibugs>	 (03CR) 10Scott French: [C:03+1] rpc: add script for running jobs from stdin rather than http (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan)
[20:32:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrus: Stop general writes to private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson)
[20:32:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:33:51] <MatmaRex>	 thank you very much for deploying ebernhardson
[20:33:59] <ebernhardson>	 certainly
[20:35:14] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus: Stop general writes to private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332)
[20:36:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson)
[20:36:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[20:37:09] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Stop general writes to private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson)
[20:37:19] <logmsgbot>	 !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1063026|cirrus: Stop general writes to private wikis (T341332)]]
[20:37:51] <stashbot>	 T341332: [EPIC] The CirrusSearch streaming updater should support private wikis - https://phabricator.wikimedia.org/T341332
[20:39:17] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1063026|cirrus: Stop general writes to private wikis (T341332)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:41:15] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson: Continuing with sync
[20:45:45] <logmsgbot>	 !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063026|cirrus: Stop general writes to private wikis (T341332)]] (duration: 08m 25s)
[20:46:10] <stashbot>	 T341332: [EPIC] The CirrusSearch streaming updater should support private wikis - https://phabricator.wikimedia.org/T341332
[20:49:08] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[20:51:35] <wikibugs>	 (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: bump version, write weighted tags to ES [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063035 (https://phabricator.wikimedia.org/T372362) (owner: 10Peter Fischer)
[20:52:37] <wikibugs>	 (03Merged) 10jenkins-bot: Search update pipeline: bump version, write weighted tags to ES [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063035 (https://phabricator.wikimedia.org/T372362) (owner: 10Peter Fischer)
[20:54:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2035 to codfw - jhancock@cumin2002"
[20:55:01] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2035 to codfw - jhancock@cumin2002"
[20:55:01] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:01:13] <ebernhardson>	 !log backport window complete
[21:01:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:37:44] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[21:41:51] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs graph-split: data xfer needs python3-snappy [puppet] - 10https://gerrit.wikimedia.org/r/1063067 (https://phabricator.wikimedia.org/T364077)
[21:42:24] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063067 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper)
[21:43:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002"
[21:43:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002"
[21:43:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:44:36] <wikibugs>	 (03PS1) 10BCornwall: varnish: Remove carriers netmap [puppet] - 10https://gerrit.wikimedia.org/r/1063069 (https://phabricator.wikimedia.org/T370200)
[21:45:57] <wikibugs>	 (03PS8) 10Ryan Kemper: wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077)
[21:46:28] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with reboot policy FORCED
[21:46:56] <logmsgbot>	 !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:47:04] <logmsgbot>	 !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:48:28] <wikibugs>	 (03CR) 10Bking: [C:03+1] "The PCC failure is for Puppet 5 only; the change works with Puppet 7." [puppet] - 10https://gerrit.wikimedia.org/r/1063067 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper)
[21:48:37] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs graph-split: data xfer needs python3-snappy [puppet] - 10https://gerrit.wikimedia.org/r/1063067 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper)
[21:50:44] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] thanos: temp disable compact [puppet] - 10https://gerrit.wikimedia.org/r/1062678 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi)
[21:51:16] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Ensure the alert[12]002 hosts use the alerting_host role [puppet] - 10https://gerrit.wikimedia.org/r/1062444 (https://phabricator.wikimedia.org/T372418)
[21:51:16] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1062444/3666/" [puppet] - 10https://gerrit.wikimedia.org/r/1062444 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[21:53:15] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling neither afterwards
[21:53:15] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi)
[21:53:31] <stashbot>	 T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754
[21:53:55] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling neither afterwards
[21:54:03] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] alert: Ensure the alert[12]002 hosts use the alerting_host role [puppet] - 10https://gerrit.wikimedia.org/r/1062444 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[21:54:27] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards
[21:55:06] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Add the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063063 (https://phabricator.wikimedia.org/T372418)
[21:55:06] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1063063/3667/" [puppet] - 10https://gerrit.wikimedia.org/r/1063063 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[21:55:32] <wikibugs>	 (03CR) 10BCornwall: "FWIW, vtc tests passed:" [puppet] - 10https://gerrit.wikimedia.org/r/1063069 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall)
[21:59:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper)
[22:01:07] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudlb2004-dev.mgmt.codfw.wmnet with reboot policy FORCED
[22:02:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev']
[22:09:30] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudlb2004-dev']
[22:10:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm
[22:10:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10068225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm
[22:13:34] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Ensure alert1002 is the active alert host [puppet] - 10https://gerrit.wikimedia.org/r/1063075 (https://phabricator.wikimedia.org/T372418)
[22:13:34] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1063075/3668/" [puppet] - 10https://gerrit.wikimedia.org/r/1063075 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[22:21:04] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418)
[22:42:57] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards
[22:43:15] <stashbot>	 T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754
[23:10:20] <xSavitar>	 !log T372449 mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Philip Federici' 'FilippoFederici' --ignorestatus
[23:10:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:39] <stashbot>	 T372449: Unblock stuck global rename of FilippoFederici - https://phabricator.wikimedia.org/T372449
[23:30:23] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2004-dev.codfw.wmnet with OS bookworm
[23:30:29] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10068319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm executed...
[23:38:44] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1063080
[23:38:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1063080 (owner: 10TrainBranchBot)