[00:01:48] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050483 (owner: 10TrainBranchBot)
[00:05:27] <wikibugs>	 (03PS8) 10Jdlrobson: Enable action edit/submit and remaining special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366524)
[00:06:06] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage
[00:06:14] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[00:06:15] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[00:08:23] <wikibugs>	 (03CR) 10Ssingh: "codfw and drmrs are not single-backend yet and have just one NVMe drive so we cannot unify all the configs just yet, sadly." [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall)
[00:08:43] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage
[00:10:45] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 374.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:29:16] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[00:29:43] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[00:33:00] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[00:33:02] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[00:33:14] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[00:33:33] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[00:36:15] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[00:36:41] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[00:40:45] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5024.eqsin.wmnet with OS bullseye
[00:40:54] <wikibugs>	 10ops-eqsin, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9933201 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye compl...
[00:45:07] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet
[00:46:05] <wikibugs>	 10ops-eqsin, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9933202 (10BCornwall)
[00:51:01] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[00:51:34] <wikibugs>	 (03PS4) 10BCornwall: hiera: Unify all trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174)
[00:52:51] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[01:04:15] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[01:25:33] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:29:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:37:16] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[01:37:35] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[01:39:30] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[01:39:32] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[01:40:45] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 31.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:44:46] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[01:44:47] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[02:00:14] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[02:00:34] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[02:00:55] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[02:00:56] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[02:01:46] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[02:04:28] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[02:05:33] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:06:45] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[02:14:33] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[02:16:29] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[02:19:45] <jinxer-wm>	 FIRING: Device rebooted: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[02:20:59] <wikibugs>	 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368697 (10phaultfinder) 03NEW
[02:23:12] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[02:23:17] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[02:24:45] <jinxer-wm>	 RESOLVED: Device rebooted: Device ps1-a8-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[02:27:54] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[02:30:22] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[02:30:41] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[03:06:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:19:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:24:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:54:37] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:54:37] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:55:07] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:55:09] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:59:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:03:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9933337 (10Papaul)
[05:04:15] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[05:31:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:44:15] <wikibugs>	 (03PS3) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240628T0600)
[06:01:20] <wikibugs>	 (03PS2) 10Ayounsi: Cookbooks: fix Netbox 4 breaking changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275)
[06:02:37] <wikibugs>	 (03CR) 10Ayounsi: "I only tested `sre.network.debug` but seeing how small the changes are, after proper review we can fix any remaining bugs once deployed to" [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:15] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:30:47] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 335.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:52:49] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "all domains looking good from here but `wikimedia.ro`:" [dns] - 10https://gerrit.wikimedia.org/r/1050484 (owner: 10Ncmonitor)
[06:58:47] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 16.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240628T0700)
[07:09:50] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9933422 (10SGupta-WMF) Thank you @Scott_French and @mforns . I re-ran the pipe...
[07:13:54] <wikibugs>	 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9933424 (10SLyngshede-WMF) @Dzahn It's already on my todo :-)
[07:19:33] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1050070 (https://phabricator.wikimedia.org/T367295) (owner: 10Dzahn)
[07:54:22] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s4
[07:59:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:06:39] <icinga-wm>	 RECOVERY - Disk space on backup2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup2003&var-datasource=codfw+prometheus/ops
[08:12:19] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Revert "aptrepo: revert gitlab-ce version to 16.11" [puppet] - 10https://gerrit.wikimedia.org/r/1050314 (https://phabricator.wikimedia.org/T365675) (owner: 10Jelto)
[08:12:26] <wikibugs>	 (03CR) 10Elukey: Tox: add python 3.12 support (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1050262 (owner: 10Ayounsi)
[08:15:47] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:22:05] <wikibugs>	 (03CR) 10Elukey: "Left some comments mostly to better understand the changes, but looks good! I assume that this patch will be merged only when we upgrade n" [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:29:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: Maintenance
[08:29:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: Maintenance
[08:29:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2204 (T367856)', diff saved to https://phabricator.wikimedia.org/P65543 and previous config saved to /var/cache/conftool/dbconfig/20240628-082946-marostegui.json
[08:29:52] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[08:30:55] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9933522 (10elukey) Current status: * We are following up with Supermicro to customize the default root password for...
[08:33:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850#9933530 (10cmooney)
[08:34:35] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version
[08:37:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE))
[08:41:00] <wikibugs>	 (03PS1) 10Btullis: Update the image used for the ceph-csi containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050566 (https://phabricator.wikimedia.org/T327259)
[08:42:47] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:46:51] <wikibugs>	 (03PS1) 10Slavina Stefanova: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567
[08:55:31] <wikibugs>	 (03CR) 10Ayounsi: "Overall lgtm, some inline comments." [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[09:04:15] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[09:11:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#9933562 (10elukey)
[09:13:22] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9933563 (10cmooney) 05Open→03Resolved Thanks all for the help with this one!
[09:14:07] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "The tag as a timestamp is really nice! Thanks @bking@wikimedia.org" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050566 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:14:33] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update the image used for the ceph-csi containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050566 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:17:32] <wikibugs>	 (03Merged) 10jenkins-bot: Update the image used for the ceph-csi containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050566 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:19:21] <wikibugs>	 (03PS5) 10Superpes15: [pswiki] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031963 (https://phabricator.wikimedia.org/T360851)
[09:22:53] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:25:01] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] deployment_server: Add a mwscript-k8s cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[09:28:19] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1050058 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[09:31:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:33:05] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:33:18] <wikibugs>	 (03PS1) 10Elukey: admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366)
[09:33:22] <wikibugs>	 (03PS1) 10Elukey: admin_ng: upgrade cfssl-issuer's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050569 (https://phabricator.wikimedia.org/T368366)
[09:33:26] <wikibugs>	 (03PS1) 10Elukey: api,rest-gateway: upgrade Envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366)
[09:33:30] <wikibugs>	 (03PS1) 10Elukey: admin_ng: update helm-state-metrics' Docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050571 (https://phabricator.wikimedia.org/T368366)
[09:37:26] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:37:47] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:49:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[09:51:21] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: update 3d2png path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050573 (https://phabricator.wikimedia.org/T368301)
[09:51:31] <wikibugs>	 (03CR) 10Ayounsi: Homer: fix Netbox 4 breaking changes (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[09:52:13] <wikibugs>	 (03PS2) 10Ayounsi: Tox: add python 3.12 support [software/homer] - 10https://gerrit.wikimedia.org/r/1050262
[09:52:13] <wikibugs>	 (03PS4) 10Ayounsi: Homer: fix Netbox 4 breaking changes [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275)
[09:52:26] <wikibugs>	 (03CR) 10Ayounsi: Tox: add python 3.12 support (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1050262 (owner: 10Ayounsi)
[09:57:47] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.remove-downtime for ml-serve2007.codfw.wmnet
[09:57:48] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-serve2007.codfw.wmnet
[10:05:03] <wikibugs>	 (03PS1) 10Btullis: Update the ceph-csi image to add missing libraries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050575 (https://phabricator.wikimedia.org/T327259)
[10:05:36] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Update the ceph-csi image to add missing libraries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050575 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[10:08:09] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update the ceph-csi image to add missing libraries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050575 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[10:08:32] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] thumbor: update 3d2png path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050573 (https://phabricator.wikimedia.org/T368301) (owner: 10Hnowlan)
[10:09:15] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:09:16] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] thumbor: update 3d2png path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050573 (https://phabricator.wikimedia.org/T368301) (owner: 10Hnowlan)
[10:09:25] <wikibugs>	 (03CR) 10Ayounsi: "First (quick) pass." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[10:11:18] <wikibugs>	 (03Merged) 10jenkins-bot: Update the ceph-csi image to add missing libraries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050575 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[10:12:21] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: update 3d2png path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050573 (https://phabricator.wikimedia.org/T368301) (owner: 10Hnowlan)
[10:12:24] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:16:54] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[10:17:01] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[10:17:17] <wikibugs>	 (03CR) 10Klausman: [C:03+1] "LGTM, but I defer to Hugh on the final yea/nay." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[10:17:44] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: upgrade cfssl-issuer's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050569 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[10:17:58] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[10:18:33] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[10:21:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Add deploy1003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1050345 (https://phabricator.wikimedia.org/T364416) (owner: 10Clément Goubert)
[10:22:24] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[10:22:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[10:22:34] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:30:17] <wikibugs>	 (03PS1) 10Clément Goubert: Move 5 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1050579 (https://phabricator.wikimedia.org/T351074)
[10:30:46] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[10:33:45] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Move 5 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1050579 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[10:34:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Update gnmic config to allow processing of all interface stats [puppet] - 10https://gerrit.wikimedia.org/r/1049242 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney)
[10:34:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[10:34:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9933846 (10MoritzMuehlenhoff) Let's directly install this server with Puppet 7, there should be no issues in the deployment-server manifests in terms of Puppet 5/7 compat at this point.
[10:35:28] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:37:33] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Move 5 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1050579 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[10:40:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1412 to wikikube-worker1027
[10:40:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[10:42:38] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1412 to wikikube-worker1027 - cgoubert@cumin1002"
[10:43:53] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1412 to wikikube-worker1027 - cgoubert@cumin1002"
[10:43:54] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:43:54] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1027
[10:43:55] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[10:44:02] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:44:54] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1027
[10:45:02] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1412 to wikikube-worker1027
[10:45:26] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1027.eqiad.wmnet on all recursors
[10:45:29] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1027.eqiad.wmnet on all recursors
[10:45:40] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:45:41] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[10:46:01] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1413 to wikikube-worker1028
[10:46:06] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[10:48:32] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1413 to wikikube-worker1028 - cgoubert@cumin1002"
[10:49:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1413 to wikikube-worker1028 - cgoubert@cumin1002"
[10:49:41] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:49:41] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028
[10:50:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028
[10:50:48] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1413 to wikikube-worker1028
[10:50:53] <logmsgbot>	 jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade.
[10:51:13] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1028.eqiad.wmnet on all recursors
[10:51:16] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1028.eqiad.wmnet on all recursors
[10:51:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:51:28] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1028.eqiad.wmnet with OS bullseye
[10:51:39] <Dreamy_Jazz>	 !log Running `foreachwikiindblist group1.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --batch-size=200`
[10:51:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:52] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1417 to wikikube-worker1029
[10:51:58] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[10:54:09] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1417 to wikikube-worker1029 - cgoubert@cumin1002"
[10:54:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns
[10:56:22] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1417 to wikikube-worker1029 - cgoubert@cumin1002"
[10:56:22] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:56:22] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1029
[10:56:55] <Dreamy_Jazz>	 !log Stopped running script at `cawiki`
[10:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:58] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1029
[10:58:07] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1417 to wikikube-worker1029
[10:58:30] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[10:58:37] <Dreamy_Jazz>	 !log Running `foreachwikiindblist group1-wikipedia.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --batch-size=200`
[10:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:07] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1029.eqiad.wmnet on all recursors
[10:59:10] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1029.eqiad.wmnet on all recursors
[10:59:57] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1029.eqiad.wmnet with OS bullseye
[11:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240628T0700)
[11:00:05] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240628T1100).
[11:00:27] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1418 to wikikube-worker1030
[11:00:32] <wikibugs>	 (03PS1) 10Btullis: Set the fsGroup to 900 for the ceph-csi provisioner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050585 (https://phabricator.wikimedia.org/T327259)
[11:00:32] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[11:00:57] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Set the fsGroup to 900 for the ceph-csi provisioner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050585 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[11:01:47] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[11:02:52] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1418 to wikikube-worker1030 - cgoubert@cumin1002"
[11:04:13] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1418 to wikikube-worker1030 - cgoubert@cumin1002"
[11:04:13] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:04:13] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1030
[11:04:14] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Set the fsGroup to 900 for the ceph-csi provisioner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050585 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[11:05:41] <icinga-wm>	 PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3498 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:06:05] <jelto>	 ^ should resolve soon
[11:06:18] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version
[11:06:43] <icinga-wm>	 RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 108890 bytes in 1.031 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:06:52] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] bashrc: adds alias for ripgrep [puppet] - 10https://gerrit.wikimedia.org/r/1050398 (owner: 10Arnaudb)
[11:06:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:07:01] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[11:07:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[11:08:05] <wikibugs>	 (03Merged) 10jenkins-bot: Set the fsGroup to 900 for the ceph-csi provisioner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050585 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[11:08:38] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1030
[11:08:46] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1418 to wikikube-worker1030
[11:09:19] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1450 to wikikube-worker1031
[11:09:25] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[11:10:33] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:11:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1030.eqiad.wmnet on all recursors
[11:11:16] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1030.eqiad.wmnet on all recursors
[11:11:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:11:29] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1030.eqiad.wmnet with OS bullseye
[11:11:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:11:57] <Dreamy_Jazz>	 !log `foreachwikiindblist group1-wikipedia.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --batch-size=200` finished running
[11:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:56] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided)
[11:13:16] <Dreamy_Jazz>	 !log Running `foreachwikiindblist medium.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --batch-size=200` for T366781. `medium.dblist` does not include `loginwiki` or `metawiki` (which are to be done later).
[11:13:21] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) (duration: 00m 25s)
[11:13:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:21] <stashbot>	 T366781: Run maintenance script to delete entries only for use when reading old on WMF wikis - https://phabricator.wikimedia.org/T366781
[11:14:15] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:14:42] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1450 to wikikube-worker1031 - cgoubert@cumin1002"
[11:15:29] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1028.eqiad.wmnet with OS bullseye
[11:15:42] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1028.eqiad.wmnet with OS bullseye
[11:16:04] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[11:16:06] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1450 to wikikube-worker1031 - cgoubert@cumin1002"
[11:16:06] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:16:06] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1031
[11:17:05] <wikibugs>	 (03PS2) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046678
[11:17:27] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1031
[11:17:35] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1450 to wikikube-worker1031
[11:18:05] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1031.eqiad.wmnet on all recursors
[11:18:08] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1031.eqiad.wmnet on all recursors
[11:18:21] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1031.eqiad.wmnet with OS bullseye
[11:23:57] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1028.eqiad.wmnet with OS bullseye
[11:24:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1028.eqiad.wmnet with OS bullseye
[11:25:50] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Optimize static footer 'a Wikimedia project' icon further [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047521 (https://phabricator.wikimedia.org/T256190) (owner: 10VolkerE)
[11:25:59] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "Bah." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047521 (https://phabricator.wikimedia.org/T256190) (owner: 10VolkerE)
[11:26:16] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[11:26:29] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "I'm planning to deploy this on Monday. Sorry for missing this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047521 (https://phabricator.wikimedia.org/T256190) (owner: 10VolkerE)
[11:28:05] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[11:28:12] <wikibugs>	 (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/991452 (owner: 10PipelineBot)
[11:28:16] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/992662 (owner: 10PipelineBot)
[11:28:20] <wikibugs>	 (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/995362 (owner: 10PipelineBot)
[11:28:23] <wikibugs>	 (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006875 (owner: 10PipelineBot)
[11:28:27] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016384 (owner: 10PipelineBot)
[11:28:31] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019766 (owner: 10PipelineBot)
[11:28:34] <wikibugs>	 (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022033 (owner: 10PipelineBot)
[11:28:38] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030557 (owner: 10PipelineBot)
[11:28:41] <wikibugs>	 (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031594 (owner: 10PipelineBot)
[11:28:45] <wikibugs>	 (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047465 (owner: 10PipelineBot)
[11:28:55] <wikibugs>	 (03PS2) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049506
[11:29:13] <wikibugs>	 (03PS1) 10Btullis: Increase the eventgate canary log_level to trace, temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050588 (https://phabricator.wikimedia.org/T368495)
[11:29:14] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided)
[11:29:36] <logmsgbot>	 !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1029.eqiad.wmnet with OS bullseye
[11:29:51] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1029.eqiad.wmnet with OS bullseye
[11:29:59] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) (duration: 00m 44s)
[11:30:25] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[11:30:59] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] Increase the eventgate canary log_level to trace, temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050588 (https://phabricator.wikimedia.org/T368495) (owner: 10Btullis)
[11:31:14] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[11:31:54] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1031.eqiad.wmnet with reason: host reimage
[11:35:19] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1031.eqiad.wmnet with reason: host reimage
[11:37:47] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Increase the eventgate canary log_level to trace, temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050588 (https://phabricator.wikimedia.org/T368495) (owner: 10Btullis)
[11:38:23] <logmsgbot>	 !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1030.eqiad.wmnet with OS bullseye
[11:38:33] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1030.eqiad.wmnet with OS bullseye
[11:38:38] <wikibugs>	 (03Merged) 10jenkins-bot: Increase the eventgate canary log_level to trace, temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050588 (https://phabricator.wikimedia.org/T368495) (owner: 10Btullis)
[11:41:15] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050590
[11:41:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050590 (owner: 10TrainBranchBot)
[11:44:04] <logmsgbot>	 !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1029.eqiad.wmnet with OS bullseye
[11:44:07] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[11:44:11] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[11:44:17] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1029.eqiad.wmnet with OS bullseye
[11:44:54] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1028.eqiad.wmnet with reason: host reimage
[11:45:15] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[11:45:29] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[11:45:40] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[11:46:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:47:33] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:47:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1028.eqiad.wmnet with reason: host reimage
[11:49:05] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:50:43] <Dreamy_Jazz>	 !log Finished run on `medium.dblist`
[11:50:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:51:43] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1030.eqiad.wmnet with reason: host reimage
[11:54:00] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1031.eqiad.wmnet with OS bullseye
[11:55:00] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1030.eqiad.wmnet with reason: host reimage
[11:57:34] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:59:04] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:59:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:05:18] <logmsgbot>	 !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[12:05:44] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[12:06:06] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1028.eqiad.wmnet with OS bullseye
[12:08:18] <icinga-wm>	 PROBLEM - Host wikikube-worker1028 is DOWN: PING CRITICAL - Packet loss = 100%
[12:10:46] <icinga-wm>	 RECOVERY - Host wikikube-worker1028 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms
[12:13:57] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1030.eqiad.wmnet with OS bullseye
[12:14:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T367856)', diff saved to https://phabricator.wikimedia.org/P65544 and previous config saved to /var/cache/conftool/dbconfig/20240628-121404-marostegui.json
[12:14:11] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[12:15:13] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[12:17:38] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for an-conf1005,6 - jclark@cumin1002"
[12:18:42] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for an-conf1005,6 - jclark@cumin1002"
[12:18:42] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:21:19] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-conf1004
[12:21:57] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050590 (owner: 10TrainBranchBot)
[12:23:07] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-conf1004
[12:23:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-conf1006
[12:24:37] <wikibugs>	 (03PS1) 10Hashar: Update Gerrit 3.10 snapshot [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050595 (https://phabricator.wikimedia.org/T367029)
[12:24:48] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-conf1006
[12:29:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P65545 and previous config saved to /var/cache/conftool/dbconfig/20240628-122911-marostegui.json
[12:32:28] <wikibugs>	 (03PS1) 10TChin: EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134)
[12:34:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9934200 (10cmooney) @fgiunchedi I was perhaps a little cheeky and merged this, but it was clear the volume of new metrics was well withi...
[12:35:25] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] "Done" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[12:35:26] <logmsgbot>	 !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1029.eqiad.wmnet with OS bullseye
[12:35:58] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: deploy1003: Switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1050597 (https://phabricator.wikimedia.org/T364416)
[12:37:03] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1029.eqiad.wmnet with OS bullseye
[12:37:36] <wikibugs>	 (03PS1) 10Cathal Mooney: Change gnmi sampling interval and enable timestamps for prom output [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322)
[12:39:08] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Update Gerrit 3.10 snapshot [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050595 (https://phabricator.wikimedia.org/T367029) (owner: 10Hashar)
[12:39:38] <wikibugs>	 (03Merged) 10jenkins-bot: Update Gerrit 3.10 snapshot [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050595 (https://phabricator.wikimedia.org/T367029) (owner: 10Hashar)
[12:40:19] <wikibugs>	 (03CR) 10Elukey: Homer: fix Netbox 4 breaking changes (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[12:43:56] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] bashrc: adds alias for ripgrep [puppet] - 10https://gerrit.wikimedia.org/r/1050398 (owner: 10Arnaudb)
[12:44:12] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-conf1004
[12:44:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:44:15] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-conf1004
[12:44:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P65546 and previous config saved to /var/cache/conftool/dbconfig/20240628-124419-marostegui.json
[12:44:40] <logmsgbot>	 !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[12:45:33] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:45:42] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[12:48:05] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[12:48:18] <icinga-wm>	 PROBLEM - Host wikikube-worker1028 is DOWN: PING CRITICAL - Packet loss = 100%
[12:48:18] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[12:50:06] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1029.eqiad.wmnet with reason: host reimage
[12:50:46] <icinga-wm>	 RECOVERY - Host wikikube-worker1028 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[12:51:54] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9934251 (10Jclark-ctr)
[12:51:58] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9934253 (10mforns) Yay! Thanks @SGupta-WMF
[12:52:04] <wikibugs>	 (03PS1) 10Elukey: TESTING ONLY - profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023)
[12:53:14] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1029.eqiad.wmnet with reason: host reimage
[12:53:15] <hashar>	 I am going to upgrade Gerrit to apply some patches for regressions we have discovered over the week
[12:53:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] deploy1003: Switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1050597 (https://phabricator.wikimedia.org/T364416) (owner: 10Alexandros Kosiaris)
[12:53:23] <hashar>	 that will be a short downtime
[12:53:33] <wikibugs>	 (03PS2) 10Elukey: TESTING ONLY - profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023)
[12:54:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:54:52] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[12:55:19] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@0db053e]: Upgrade Gerrit 3.10.0-32-gf77960412e to 3.10.0-71-gf6e9431fff - T367029 T341291
[12:55:26] <stashbot>	 T367029: "Press c to comment" is placed incorrectly when using Firefox 126 and 128 on macOS - https://phabricator.wikimedia.org/T367029
[12:55:26] <stashbot>	 T341291: Install gerrit image-diff plugin - https://phabricator.wikimedia.org/T341291
[12:55:28] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@0db053e]: Upgrade Gerrit 3.10.0-32-gf77960412e to 3.10.0-71-gf6e9431fff - T367029 T341291 (duration: 00m 09s)
[12:55:33] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:57:53] <wikibugs>	 (03PS3) 10Elukey: profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023)
[12:59:02] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3108/" [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[12:59:02] <hashar>	 I am stopping Gerrit NOW
[12:59:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T367856)', diff saved to https://phabricator.wikimedia.org/P65547 and previous config saved to /var/cache/conftool/dbconfig/20240628-125926-marostegui.json
[12:59:35] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[13:01:08] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host deploy1003.eqiad.wmnet with OS bookworm
[13:03:08] <wikibugs>	 (03PS25) 10DCausse: wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950)
[13:03:08] <wikibugs>	 (03PS5) 10DCausse: wdqs: enable throttling only for requests coming from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950)
[13:03:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: enable throttling only for requests coming from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[13:04:15] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[13:05:53] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1028.eqiad.wmnet with reason: mgmt ip issue
[13:06:06] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1028.eqiad.wmnet with reason: mgmt ip issue
[13:09:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "One nit, looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:10:33] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:11:10] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1029.eqiad.wmnet with OS bullseye
[13:11:14] <wikibugs>	 (03CR) 10Elukey: [V:03+1] profile::puppetserver::git: add an option to exclude servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:11:31] <wikibugs>	 (03PS4) 10Elukey: profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023)
[13:11:43] <wikibugs>	 (03CR) 10Elukey: profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:12:17] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage
[13:12:38] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3109/" [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:13:26] <wikibugs>	 (03PS26) 10DCausse: wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950)
[13:13:27] <wikibugs>	 (03PS6) 10DCausse: wdqs: enable throttling only for requests coming from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950)
[13:14:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:15:12] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage
[13:16:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: enable throttling only for requests coming from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[13:17:45] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: add gitlab to recipient discards [puppet] - 10https://gerrit.wikimedia.org/r/1050058 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[13:18:24] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1027.mgmt.eqiad.wmnet on all recursors
[13:18:27] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1027.mgmt.eqiad.wmnet on all recursors
[13:18:54] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1028.mgmt.eqiad.wmnet on all recursors
[13:18:57] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1028.mgmt.eqiad.wmnet on all recursors
[13:19:55] <wikibugs>	 (03PS1) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023)
[13:20:33] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:21:02] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[13:21:18] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3110/" [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:22:12] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good! We should definitely run PCC on one each of text/upload in all sites, just to be extra sure before rolling this out. Or maybe " [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall)
[13:24:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:24:29] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for wikikube-worker102[7-8] - cmooney@cumin1002"
[13:25:28] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for wikikube-worker102[7-8] - cmooney@cumin1002"
[13:25:28] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:26:53] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[13:27:22] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1027.mgmt.eqiad.wmnet on all recursors
[13:27:25] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1027.mgmt.eqiad.wmnet on all recursors
[13:27:31] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1028.mgmt.eqiad.wmnet on all recursors
[13:27:34] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1028.mgmt.eqiad.wmnet on all recursors
[13:28:22] <hnowlan>	 !log running `decommission` on 5 codfw api appservers 
[13:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:15] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1002"
[13:29:59] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[13:31:22] <wikibugs>	 (03PS2) 10Dzahn: admin: add jsn to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1050070 (https://phabricator.wikimedia.org/T367295)
[13:32:06] <wikibugs>	 (03PS2) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023)
[13:32:48] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host krb1002.eqiad.wmnet with OS bookworm
[13:33:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9934374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host krb1002.eqiad.wmnet with OS bookworm
[13:33:14] <wikibugs>	 (03PS1) 10Btullis: Revert "Increase the eventgate canary log_level to trace, temporarily" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050608
[13:33:21] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: add jsn to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1050070 (https://phabricator.wikimedia.org/T367295) (owner: 10Dzahn)
[13:33:23] <wikibugs>	 (03PS2) 10Btullis: Revert "Increase the eventgate canary log_level to trace, temporarily" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050608 (https://phabricator.wikimedia.org/T368495)
[13:34:54] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Increase the eventgate canary log_level to trace, temporarily" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050608 (https://phabricator.wikimedia.org/T368495) (owner: 10Btullis)
[13:34:58] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 379.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:35:22] <wikibugs>	 (03PS3) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023)
[13:35:49] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Increase the eventgate canary log_level to trace, temporarily" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050608 (https://phabricator.wikimedia.org/T368495) (owner: 10Btullis)
[13:35:59] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "one suggestion, otherwise looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:36:54] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: move 5 codfw api appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1050609 (https://phabricator.wikimedia.org/T351074)
[13:37:02] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295#9934379 (10Dzahn) 05In progress→03Resolved @jsn.sherman You have now been added to the group as requested.  Feel free...
[13:37:50] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[13:38:04] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[13:38:17] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[13:38:27] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[13:38:31] <wikibugs>	 (03PS1) 10DDesouza: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050610 (https://phabricator.wikimedia.org/T344471)
[13:38:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:39:33] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] kubernetes: move 5 codfw api appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1050609 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[13:39:39] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050610 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[13:39:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368697#9934399 (10Dzahn)
[13:39:56] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] kubernetes: move 5 codfw api appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1050609 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[13:40:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368697#9934403 (10Dzahn) →14Duplicate dup:03T368648
[13:40:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368648#9934401 (10Dzahn)
[13:40:53] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050610 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[13:41:10] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1002"
[13:41:12] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host deploy1003.eqiad.wmnet with OS bookworm
[13:41:36] <wikibugs>	 (03PS4) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023)
[13:41:46] <logmsgbot>	 !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[13:42:03] <logmsgbot>	 !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[13:42:04] <logmsgbot>	 !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[13:42:08] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host deploy1003.eqiad.wmnet with OS bullseye
[13:42:36] <logmsgbot>	 !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[13:42:37] <logmsgbot>	 !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[13:42:45] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:42:56] <logmsgbot>	 !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[13:43:18] <wikibugs>	 (03PS2) 10Slavina Stefanova: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567
[13:43:54] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1027.eqiad.wmnet with reason: host reimage
[13:43:58] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:44:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688#9934431 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[13:44:42] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <wmf> for <Sharvaniharan> - https://phabricator.wikimedia.org/T368566#9934434 (10Dzahn) @Ottomata Sharvaniharan requested access to superset and the "wmf" group here but I noticed she already has that.  Based on the description "Currently dashboard is not loading...
[13:44:54] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <wmf> for <Sharvaniharan> - https://phabricator.wikimedia.org/T368566#9934439 (10Dzahn) a:05Dzahn→03None
[13:45:09] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <wmf> for <Sharvaniharan> - https://phabricator.wikimedia.org/T368566#9934442 (10Dzahn) 05In progress→03Open
[13:45:13] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on krb1002.eqiad.wmnet with reason: host reimage
[13:45:55] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2298 to wikikube-worker2025
[13:46:08] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <wmf> for <Sharvaniharan> - https://phabricator.wikimedia.org/T368566#9934454 (10Dzahn)
[13:46:09] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1027.eqiad.wmnet with reason: host reimage
[13:46:12] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[13:46:20] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2300 to wikikube-worker2026
[13:46:38] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw2300 to wikikube-worker2026
[13:47:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (owner: 10Slavina Stefanova)
[13:49:02] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2298 to wikikube-worker2025 - hnowlan@cumin1002"
[13:49:03] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb1002.eqiad.wmnet with reason: host reimage
[13:50:13] <wikibugs>	 (03PS1) 10Hashar: gerrit: enable "new" image diff UI [puppet] - 10https://gerrit.wikimedia.org/r/1050614 (https://phabricator.wikimedia.org/T341291)
[13:53:19] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage
[13:54:00] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2298 to wikikube-worker2025 - hnowlan@cumin1002"
[13:54:00] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:54:00] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2025
[13:54:24] <wikibugs>	 (03PS5) 10Elukey: profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023)
[13:54:24] <wikibugs>	 (03PS5) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023)
[13:54:25] <wikibugs>	 (03PS1) 10Btullis: ceph-csi: revert fsGroup change and disable metrics container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050615 (https://phabricator.wikimedia.org/T327259)
[13:54:30] <wikibugs>	 (03CR) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[13:54:34] <wikibugs>	 (03CR) 10Elukey: profile::puppetserver::git: add an option to exclude servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:56:54] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage
[13:56:57] <wikibugs>	 (03CR) 10DCausse: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[13:57:23] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] ceph-csi: revert fsGroup change and disable metrics container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050615 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[13:57:27] <wikibugs>	 (03CR) 10Hashar: "I think it is fine to enable at anytime though I am pretty sure the Gerrit daemon requires to be restarted to apply the change." [puppet] - 10https://gerrit.wikimedia.org/r/1050614 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar)
[13:58:15] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] temporarily add mx-in1001 as an MX server [dns] - 10https://gerrit.wikimedia.org/r/1050426 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway)
[13:58:48] <wikibugs>	 (03CR) 10Btullis: [C:03+2] ceph-csi: revert fsGroup change and disable metrics container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050615 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[13:59:22] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2025
[13:59:30] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2298 to wikikube-worker2025
[14:00:37] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2306 to wikikube-worker2027
[14:00:43] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[14:01:31] <jhathaway>	 !log ingressing email on mx-in1001, initial test 1hr
[14:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:47] <wikibugs>	 (03Merged) 10jenkins-bot: ceph-csi: revert fsGroup change and disable metrics container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050615 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[14:03:14] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2306 to wikikube-worker2027 - hnowlan@cumin1002"
[14:04:15] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:04:40] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:05:17] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1027.eqiad.wmnet with OS bullseye
[14:05:38] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on mx-in1001.wikimedia.org with reason: email testing
[14:06:02] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-in1001.wikimedia.org with reason: email testing
[14:06:58] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2306 to wikikube-worker2027 - hnowlan@cumin1002"
[14:06:58] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:06:58] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2027
[14:06:59] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:07:12] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2027
[14:07:21] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2306 to wikikube-worker2027
[14:07:59] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2308 to wikikube-worker2028
[14:08:04] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[14:09:01] <wikibugs>	 (03PS1) 10Kosta Harlan: IPReputation: Enable extension on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050619 (https://phabricator.wikimedia.org/T360067)
[14:10:38] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host deploy1003.eqiad.wmnet with OS bullseye
[14:12:02] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2308 to wikikube-worker2028 - hnowlan@cumin1002"
[14:14:15] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:15:36] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:21:59] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2308 to wikikube-worker2028 - hnowlan@cumin1002"
[14:21:59] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:22:00] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2028
[14:22:10] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2028
[14:22:19] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2308 to wikikube-worker2028
[14:22:37] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2330 to wikikube-worker2029
[14:22:40] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:22:42] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host krb1002.eqiad.wmnet with OS bookworm
[14:22:43] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[14:22:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9934580 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host krb1002.eqiad.wmnet with OS bookworm completed: - krb1002 (**WARN**)...
[14:23:20] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[14:23:48] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[14:23:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9934586 (10Jclark-ctr)
[14:24:12] <wikibugs>	 (03PS1) 10Ssingh: durum: remove redundant anycast_peers [puppet] - 10https://gerrit.wikimedia.org/r/1050620
[14:24:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9934589 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[14:24:47] <wikibugs>	 (03PS2) 10Ssingh: durum: remove redundant override for profile::systemd::timesyncd::ntp_servers [puppet] - 10https://gerrit.wikimedia.org/r/1050620
[14:24:56] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[14:25:16] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2330 to wikikube-worker2029 - hnowlan@cumin1002"
[14:25:20] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3114/console" [puppet] - 10https://gerrit.wikimedia.org/r/1050620 (owner: 10Ssingh)
[14:25:24] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[14:26:15] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] durum: remove redundant override for profile::systemd::timesyncd::ntp_servers [puppet] - 10https://gerrit.wikimedia.org/r/1050620 (owner: 10Ssingh)
[14:26:48] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2330 to wikikube-worker2029 - hnowlan@cumin1002"
[14:26:48] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:26:48] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2029
[14:27:13] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2029
[14:27:22] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2330 to wikikube-worker2029
[14:27:27] <sukhe>	 !log sudo cumin "O:durum" "run-puppet-agent"
[14:27:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:15] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2025.codfw.wmnet with OS bullseye
[14:28:50] <wikibugs>	 (03PS1) 10Btullis: cephcsi: disable the metrics container in the nodeplugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050622 (https://phabricator.wikimedia.org/T327259)
[14:29:11] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2027.codfw.wmnet with OS bullseye
[14:30:00] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2028.codfw.wmnet with OS bullseye
[14:30:28] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2029.codfw.wmnet with OS bullseye
[14:30:46] <wikibugs>	 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368732 (10phaultfinder) 03NEW
[14:33:47] <wikibugs>	 (03CR) 10Btullis: [C:03+2] cephcsi: disable the metrics container in the nodeplugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050622 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[14:33:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver: Final mw-on-k8s cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[14:34:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver: Cleanup mw-on-k8s scripts [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[14:34:15] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:34:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver::lua_script: Implement ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1050293 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[14:35:33] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:36:43] <wikibugs>	 (03Merged) 10jenkins-bot: cephcsi: disable the metrics container in the nodeplugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050622 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[14:39:15] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:34] <wikibugs>	 (03PS5) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850)
[14:41:13] <wikibugs>	 (03PS1) 10Btullis: Configure fsgroup for the cephcsi nodeplugin pod to be 900 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259)
[14:41:46] <wikibugs>	 (03PS59) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[14:41:50] <wikibugs>	 06SRE, 10Cloud-Services, 06DBA, 07Tracking-Neverending: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#9934676 (10sguebo_WMF) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikime...
[14:41:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Configure fsgroup for the cephcsi nodeplugin pod to be 900 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[14:42:01] <wikibugs>	 (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[14:42:38] <wikibugs>	 (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[14:42:42] <wikibugs>	 (03CR) 10Scott French: [C:03+2] services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[14:43:34] <wikibugs>	 (03Merged) 10jenkins-bot: services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[14:44:19] <wikibugs>	 (03PS1) 10Ssingh: hiera: dns6001: reduce anycast_hc logging level and backups [puppet] - 10https://gerrit.wikimedia.org/r/1050626
[14:45:05] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2025.codfw.wmnet with reason: host reimage
[14:45:21] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050626 (owner: 10Ssingh)
[14:45:24] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2027.codfw.wmnet with reason: host reimage
[14:46:32] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2028.codfw.wmnet with reason: host reimage
[14:46:51] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2029.codfw.wmnet with reason: host reimage
[14:47:55] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks!  Updated based on some of the feedback.  I'll delve into the get_link_data function to try and address the concerns about using th" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[14:47:59] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2025.codfw.wmnet with reason: host reimage
[14:48:15] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3116/console" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall)
[14:49:42] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: deploy1003: Assign role [puppet] - 10https://gerrit.wikimedia.org/r/1050628 (https://phabricator.wikimedia.org/T364416)
[14:50:22] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2029.codfw.wmnet with reason: host reimage
[14:51:03] <wikibugs>	 (03PS2) 10Btullis: Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259)
[14:51:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:51:33] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3118/console" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall)
[14:51:36] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#9934720 (10elukey) The safest bet is to use `python3-confluent-kafka` in my opinion, it is pa...
[14:51:38] <wikibugs>	 (03PS3) 10Btullis: Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259)
[14:51:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[14:52:54] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2028.codfw.wmnet with reason: host reimage
[14:53:37] <wikibugs>	 (03CR) 10Btullis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[14:54:18] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply
[14:55:11] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[14:55:15] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[14:55:21] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1659 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[14:56:16] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2027.codfw.wmnet with reason: host reimage
[14:56:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:58:19] <wikibugs>	 (03Merged) 10jenkins-bot: Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[14:58:31] <wikibugs>	 (03PS1) 10JHathaway: Revert "temporarily add mx-in1001 as an MX server" [dns] - 10https://gerrit.wikimedia.org/r/1050630
[14:58:48] <wikibugs>	 (03PS3) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850)
[14:58:55] <wikibugs>	 (03CR) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[14:58:57] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Revert "temporarily add mx-in1001 as an MX server" [dns] - 10https://gerrit.wikimedia.org/r/1050630 (owner: 10JHathaway)
[14:59:15] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:00:18] <wikibugs>	 (03PS1) 10AikoChou: ml-services: enable ALLOW_REVISION_JSON_INPUT for revertrisk in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050631 (https://phabricator.wikimedia.org/T356102)
[15:00:19] <wikibugs>	 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9934769 (10spatton) Hey teammates and cc @Dzahn as you recommended us posting here in regard to the issue @Pcoombe found and reported in T368645.  Can we update this task's description...
[15:00:19] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Revert "temporarily add mx-in1001 as an MX server" [dns] - 10https://gerrit.wikimedia.org/r/1050630 (owner: 10JHathaway)
[15:00:38] <claime>	 !log homer 'cr*eqiad*' commit 'T351074'
[15:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:44] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[15:00:48] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:02:51] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall)
[15:04:22] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply
[15:05:01] <wikibugs>	 (03PS1) 10Btullis: Fix the values.yaml file for the cephcsi deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050633 (https://phabricator.wikimedia.org/T327259)
[15:05:07] <wikibugs>	 (03PS1) 10Scott French: commons-impact-analytics: correct binary name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050634 (https://phabricator.wikimedia.org/T361835)
[15:05:33] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:55] <jhathaway>	 !log mx-in1001 postfix mx testing complete
[15:06:56] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] commons-impact-analytics: correct binary name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050634 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[15:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:00] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2025.codfw.wmnet with OS bullseye
[15:08:15] <wikibugs>	 (03CR) 10JHathaway: profile::puppetserver::git: add an option to exclude servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[15:08:33] <wikibugs>	 (03CR) 10Scott French: [C:03+2] commons-impact-analytics: correct binary name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050634 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[15:08:45] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Fix the values.yaml file for the cephcsi deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050633 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[15:09:05] <wikibugs>	 (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[15:09:15] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:09:15] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:20] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2029.codfw.wmnet with OS bullseye
[15:09:35] <wikibugs>	 (03Merged) 10jenkins-bot: commons-impact-analytics: correct binary name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050634 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[15:10:09] <claime>	 !log Pooling and uncordoning wikikube-worker1027.eqiad.wmnet,wikikube-worker1028.eqiad.wmnet,wikikube-worker1029.eqiad.wmnet,wikikube-worker1030.eqiad.wmnet,wikikube-worker1031.eqiad.wmnet - T351074
[15:10:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:15] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[15:10:18] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker1027.eqiad.wmnet|wikikube-worker1028.eqiad.wmnet|wikikube-worker1029.eqiad.wmnet|wikikube-worker1030.eqiad.wmnet|wikikube-worker1031.eqiad.wmnet),cluster=kubernetes,service=kubesvc
[15:10:24] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2028.codfw.wmnet with OS bullseye
[15:11:03] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:11:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T368639#9934831 (10Clement_Goubert)
[15:11:20] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply
[15:11:56] <wikibugs>	 (03Merged) 10jenkins-bot: Fix the values.yaml file for the cephcsi deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050633 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[15:12:13] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 4 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall)
[15:12:30] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:14:17] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2027.codfw.wmnet with OS bullseye
[15:14:27] <hnowlan>	 !log homer 'cr*codfw*' commit 'T351074'
[15:14:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:21:38] <andrewbogott>	 !log upgraded wikitech-static to 1_42 and php 8.3
[15:21:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:47] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply
[15:22:40] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:23:03] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 734.75 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:23:24] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29660 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[15:25:49] <wikibugs>	 06SRE, 10Data-Services, 06DBA, 07Tracking-Neverending: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#9934899 (10JJMC89)
[15:25:52] <logmsgbot>	 !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2025.codfw.wmnet|wikikube-worker2027.codfw.wmnet|wikikube-worker2028.codfw.wmnet|wikikube-worker2029.codfw.wmnet),cluster=kubernetes,service=kubesvc
[15:27:47] <wikibugs>	 (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[15:29:22] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T368743 (10hnowlan) 03NEW
[15:31:38] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744 (10elukey) 03NEW
[15:31:56] <wikibugs>	 (03CR) 10DCausse: wdqs: enable throttling only for requests coming from the CDN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[15:32:03] <wikibugs>	 (03PS2) 10Elukey: Allow to save new OS names without them being present on the DB [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744)
[15:32:37] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[15:34:27] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9934956 (10elukey) `docker-reporter-base-images.service` on build2001 reports an issue with the dec-puppet-client image:  ` [2024-06-28T04...
[15:34:52] <wikibugs>	 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9934962 (10Ladsgroup) Let me fix the case of donate.m.
[15:35:05] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for wikikube-worker2026 - cmooney@cumin1002"
[15:39:03] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:39:19] <wikibugs>	 (03PS1) 10Ladsgroup: wikimedia.org: Set CNAME record for donate.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882)
[15:40:01] <wikibugs>	 (03CR) 10Ssingh: "Looks good but probably best if it goes down with the rest of the CNAMEs." [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup)
[15:41:32] <wikibugs>	 (03PS2) 10Ladsgroup: wikimedia.org: Set CNAME record for donate.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882)
[15:42:24] <wikibugs>	 (03CR) 10Cathal Mooney: "Yeah it's a good point, I hadn't considered the cookbook as an alternate automation pipeline." [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[15:42:35] <wikibugs>	 (03CR) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[15:43:08] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for wikikube-worker2026 - cmooney@cumin1002"
[15:43:08] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:43:18] <wikibugs>	 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9935007 (10greg) Thanks Amir!
[15:45:08] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bullseye
[15:47:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 378301568 and 29 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:48:08] <wikibugs>	 (03PS3) 10Ladsgroup: wikimedia.org: Set CNAME record for donate.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882)
[15:48:31] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] wikimedia.org: Set CNAME record for donate.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup)
[15:48:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:49:51] <wikibugs>	 (03CR) 10Ladsgroup: "sure!" [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup)
[15:50:01] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] wikimedia.org: Set CNAME record for donate.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup)
[15:51:32] <wikibugs>	 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9935064 (10Ladsgroup) reloading zones now.
[15:59:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:03:45] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on mw2300.codfw.wmnet with reason: Reimaging issues
[16:03:47] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on mw2300.codfw.wmnet with reason: Reimaging issues
[16:06:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:08:43] <wikibugs>	 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9935118 (10greg)
[16:11:14] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: enable ALLOW_REVISION_JSON_INPUT for revertrisk in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050631 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou)
[16:11:20] <wikibugs>	 (03PS1) 10Btullis: cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050644 (https://phabricator.wikimedia.org/T327259)
[16:11:44] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9935144 (10Scott_French) Thanks so much, @SGupta-WMF.  Alright, so I think we'...
[16:12:10] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9935154 (10Scott_French)
[16:12:33] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: enable ALLOW_REVISION_JSON_INPUT for revertrisk in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050631 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou)
[16:12:49] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall)
[16:13:27] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: enable ALLOW_REVISION_JSON_INPUT for revertrisk in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050631 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou)
[16:14:33] <wikibugs>	 (03CR) 10Btullis: [C:03+2] cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050644 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[16:16:34] <wikibugs>	 (03PS6) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850)
[16:17:02] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good, PCC looks clean!" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall)
[16:17:29] <wikibugs>	 (03Merged) 10jenkins-bot: cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050644 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[16:17:53] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9935199 (10elukey) On db1195 I see for `emacs-nox`:  ` MariaDB [debmonitor]> select * from bin_packages_package where name = 'emacs-nox'...
[16:18:22] <wikibugs>	 (03CR) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[16:19:59] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[16:23:07] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[16:25:44] <wikibugs>	 (03PS1) 10Btullis: cephcsi: correct image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050645 (https://phabricator.wikimedia.org/T327259)
[16:26:30] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:28:51] <wikibugs>	 (03CR) 10Btullis: [C:03+2] cephcsi: correct image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050645 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[16:29:36] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9935247 (10elukey) Ok I see, I ran debmonitor inside the dcl image:  `     "os": "Debian 12",     "uninstalled": [],     "update_type": "f...
[16:30:16] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] deployment_server: Add a mwscript-k8s cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[16:30:41] <wikibugs>	 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9935249 (10fnegri) @bd808 @Ladsgroup thanks for your replies!  I will reiterate that the general goal is to make root access to clouddb* hosts as saf...
[16:31:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:31:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:31:48] <wikibugs>	 (03Merged) 10jenkins-bot: cephcsi: correct image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050645 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[16:33:32] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[16:34:15] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:34:46] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9935262 (10xcollazo) https://gerrit.wikimedia.org/r/1049617, which was pointed...
[16:35:33] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:36:18] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[16:36:35] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[16:38:37] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <wmf> for <Sharvaniharan> - https://phabricator.wikimedia.org/T368566#9935304 (10Sharvaniharan) Hi @Dzahn @Ottomata  If it helps, here is the full error text I am getting on my dashboard: https://docs.google.com/document/d/1A5VF4mbhCQIWPHHbylIGLJq6EGirn1ZYHXsXDPuY...
[16:41:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368648#9935311 (10VRiley-WMF)
[16:42:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368732#9935309 (10VRiley-WMF) →14Duplicate dup:03T368648
[16:43:12] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-worker2026.codfw.wmnet with OS bullseye
[16:49:58] <wikibugs>	 (03PS7) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850)
[16:51:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368648#9935338 (10VRiley-WMF) 05Open→03Resolved
[16:51:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368648#9935337 (10VRiley-WMF) This has been corrected by adjusting the power cable
[17:03:03] <wikibugs>	 (03PS1) 10Btullis: cephcsi: Use fsGroup 900 to allow /csi/csi.sock to be shared [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050648 (https://phabricator.wikimedia.org/T327259)
[17:04:15] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[17:04:15] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:09:00] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9935383 (10Dzahn) As someone involved in disabling the dumps services and pupp...
[17:09:15] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:10:08] <wikibugs>	 (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050649 (https://phabricator.wikimedia.org/T368462)
[17:34:15] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:44:15] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:45:33] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:51:39] <wikibugs>	 (03PS2) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050649 (https://phabricator.wikimedia.org/T368462)
[17:59:40] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050649 (https://phabricator.wikimedia.org/T368462) (owner: 10Santiago Faci)
[18:00:28] <wikibugs>	 (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050649 (https://phabricator.wikimedia.org/T368462) (owner: 10Santiago Faci)
[18:00:51] <wikibugs>	 (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050656 (https://phabricator.wikimedia.org/T368462)
[18:01:04] <wikibugs>	 (03PS1) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645)
[18:04:11] <wikibugs>	 (03PS2) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050656 (https://phabricator.wikimedia.org/T368462)
[18:05:16] <logmsgbot>	 !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[18:05:33] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:05:34] <logmsgbot>	 !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[18:08:06] <wikibugs>	 (03PS2) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645)
[18:08:53] <wikibugs>	 (03PS3) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645)
[18:09:44] <wikibugs>	 (03PS4) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645)
[18:09:47] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] varnish: redirect donate.wm.org Special:LandingPage to / (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[18:10:15] <wikibugs>	 (03CR) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[18:10:53] <wikibugs>	 (03CR) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[18:13:19] <wikibugs>	 (03PS2) 10Cathal Mooney: Change gnmi sampling interval and enable timestamps for prom output [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322)
[18:13:25] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050656 (https://phabricator.wikimedia.org/T368462) (owner: 10Santiago Faci)
[18:14:05] <wikibugs>	 (03PS1) 10RLazarus: deployment_server: mwscript-cleanup fixes [puppet] - 10https://gerrit.wikimedia.org/r/1050661
[18:14:15] <wikibugs>	 (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050656 (https://phabricator.wikimedia.org/T368462) (owner: 10Santiago Faci)
[18:16:19] <logmsgbot>	 !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply
[18:16:36] <logmsgbot>	 !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply
[18:16:44] <wikibugs>	 (03PS5) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645)
[18:17:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] deployment_server: mwscript-cleanup fixes [puppet] - 10https://gerrit.wikimedia.org/r/1050661 (owner: 10RLazarus)
[18:18:09] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[18:18:48] <wikibugs>	 (03PS2) 10RLazarus: deployment_server: mwscript-cleanup fixes [puppet] - 10https://gerrit.wikimedia.org/r/1050661
[18:18:56] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "wikimedia.org: Set CNAME record for donate.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1050662
[18:19:43] <wikibugs>	 (03Abandoned) 10Ladsgroup: Revert "wikimedia.org: Set CNAME record for donate.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1050662 (owner: 10Ladsgroup)
[18:19:48] <wikibugs>	 (03PS3) 10Cathal Mooney: Change gnmi sampling interval and enable timestamps for prom output [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322)
[18:20:33] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:21:16] <wikibugs>	 (03PS6) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645)
[18:22:03] <sukhe>	 !log disable puppet on A:cp-text
[18:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:12] <wikibugs>	 (03PS2) 10Scott French: eventstreams: adopt base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423)
[18:22:23] <wikibugs>	 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766 (10phaultfinder) 03NEW
[18:24:22] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[18:25:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9935568 (10cmooney) I may have spoken too soon when I said things were working fine.  It seems in codfw since the change we are only get...
[18:27:17] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[18:29:10] <wikibugs>	 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9935587 (10Dzahn)
[18:29:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:30:56] <wikibugs>	 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368767 (10phaultfinder) 03NEW
[18:32:52] <wikibugs>	 (03CR) 10Scott French: "Thanks so much for the review! Apologies for losing track of this patch. I've rebased to get back up to date, and it looks like this shoul" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French)
[18:34:48] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9935618 (10Ladsgroup) I would like to monitor the databases when the dumps sta...
[18:37:44] <wikibugs>	 (03CR) 10Scott French: [C:03+1] deployment_server: mwscript-cleanup fixes [puppet] - 10https://gerrit.wikimedia.org/r/1050661 (owner: 10RLazarus)
[18:42:27] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] deployment_server: mwscript-cleanup fixes [puppet] - 10https://gerrit.wikimedia.org/r/1050661 (owner: 10RLazarus)
[18:43:08] <wikibugs>	 (03PS1) 10Ssingh: varnish: redirect donate.m.wikimedia.org temporarily after mobile_ [puppet] - 10https://gerrit.wikimedia.org/r/1050665
[18:43:32] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9935636 (10xcollazo) Great, tentatively I've scheduled time with @BTullis on W...
[18:44:21] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] varnish: redirect donate.m.wikimedia.org temporarily after mobile_ [puppet] - 10https://gerrit.wikimedia.org/r/1050665 (owner: 10Ssingh)
[18:45:52] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9935645 (10xcollazo) >>! In T368098#9935618, @Ladsgroup wrote: > I would like...
[18:54:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368099#9935664 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate T362033
[18:54:39] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368564#9935682 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate T362033
[18:56:13] <wikibugs>	 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368767#9935733 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated psu
[18:58:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9935754 (10Jclark-ctr) a:03VRiley-WMF Did mgmt ip address get update for any maintenance you preformed?
[19:02:08] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9935782 (10Jclark-ctr) a:03BTullis
[19:02:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9935780 (10VRiley-WMF) Not that I'm aware of. I used the same cable for everything. @Eevans would you happen to know if the IP address changed on this?
[19:09:15] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:12:17] <wikibugs>	 (03PS1) 10Ssingh: varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669
[19:14:22] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669 (owner: 10Ssingh)
[19:15:22] <wikibugs>	 (03PS2) 10Ssingh: varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669 (https://phabricator.wikimedia.org/T368645)
[19:17:47] <wikibugs>	 (03PS3) 10Ssingh: varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669 (https://phabricator.wikimedia.org/T368645)
[19:19:00] <wikibugs>	 (03CR) 10BBlack: [C:03+1] varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[19:19:30] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[19:30:35] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[19:31:32] <sukhe>	 !log sudo cumin -b10 "A:cp-text" "run-puppet-agent --enable 'dont enable'": T368645
[19:31:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:37] <logmsgbot>	 !log jclark@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[19:31:42] <stashbot>	 T368645: Google search results pointing to nonexistent https://donate.m.wikimedia.org/ - https://phabricator.wikimedia.org/T368645
[19:31:48] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[19:33:03] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1028
[19:34:29] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1028
[19:35:04] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1029
[19:36:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002"
[19:36:12] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1029
[19:37:07] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002"
[19:37:07] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:46:36] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbproxy1028.eqiad.wmnet with OS bookworm
[19:46:37] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbproxy1029.eqiad.wmnet with OS bookworm
[19:46:45] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9935903 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbproxy1028.eqiad.wmnet with OS bookworm
[19:46:47] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9935904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbproxy1029.eqiad.wmnet with OS bookworm
[19:48:40] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9935918 (10Jclark-ctr)
[19:51:18] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9935933 (10Jclark-ctr) a:03Jclark-ctr
[19:54:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:57:07] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1029.eqiad.wmnet with reason: host reimage
[19:57:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1028.eqiad.wmnet with reason: host reimage
[19:59:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:59:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:00:37] <mutante>	 ^ lists1001 is not in production and we tried to disable the monitoring before but it's back..
[20:00:57] <mutante>	 downtiming and them out 
[20:01:03] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1029.eqiad.wmnet with reason: host reimage
[20:01:27] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on lists1001.wikimedia.org with reason: decomed
[20:01:39] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on lists1001.wikimedia.org with reason: decomed
[20:02:10] <mutante>	 no alerts during this shift, cya
[20:02:18] <sukhe>	 <3
[20:04:02] <wikibugs>	 (03PS1) 10Jdlrobson: Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366524)
[20:04:11] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1028.eqiad.wmnet with reason: host reimage
[20:06:51] <icinga-wm>	 RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[20:15:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[20:18:17] <wikibugs>	 (03PS1) 10Ssingh: varnish: selectively redirect donate.m.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/1050672 (https://phabricator.wikimedia.org/T368645)
[20:18:53] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[20:18:55] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1029.eqiad.wmnet with OS bookworm
[20:18:59] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9936083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbproxy1029.eqiad.wmnet with OS bookworm completed: - dbproxy1029 (**PA...
[20:19:18] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[20:20:35] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[20:20:37] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1028.eqiad.wmnet with OS bookworm
[20:20:41] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9936088 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbproxy1028.eqiad.wmnet with OS bookworm completed: - dbproxy1028 (**PA...
[20:20:57] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9936089 (10Jclark-ctr)
[20:21:04] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9936092 (10Jclark-ctr) 05Open→03Resolved
[20:29:44] <sukhe>	 !log sudo cumin "A:cp-text" 'disable-puppet "CR 1050672"'
[20:29:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:41] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] varnish: selectively redirect donate.m.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/1050672 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[20:31:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:36:23] <wikibugs>	 (03PS18) 10Gergő Tisza: Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162)
[20:38:32] <wikibugs>	 (03CR) 10Gergő Tisza: "PS 18: do not set $wgCentralAuthSsoUrlPrefix to false when on the shared domain to communicate that fact. It makes local testing more comp" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[20:40:33] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:44:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:51:46] <wikibugs>	 (03PS2) 10Jdlrobson: [July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151)
[20:54:04] <wikibugs>	 (03PS2) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151)
[20:54:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson)
[21:00:07] <wikibugs>	 (03PS3) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151)
[21:00:15] <wikibugs>	 (03PS1) 10Ssingh: varnish: make trailing / optional for donate.m redirect [puppet] - 10https://gerrit.wikimedia.org/r/1050676
[21:00:40] <wikibugs>	 (03PS4) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151)
[21:01:41] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] varnish: make trailing / optional for donate.m redirect [puppet] - 10https://gerrit.wikimedia.org/r/1050676 (owner: 10Ssingh)
[21:04:15] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[21:05:39] <sukhe>	 !log sudo cumin -b11 "A:cp-text" 'run-puppet-agent'
[21:05:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:08:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9936203 (10Andrew) a:05Andrew→03None
[21:11:27] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[21:14:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:15:33] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:16:40] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1039 - jclark@cumin1002"
[21:16:54] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1039.mgmt.eqiad.wmnet with reboot policy FORCED
[21:17:37] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1039 - jclark@cumin1002"
[21:17:37] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:17:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1040.mgmt.eqiad.wmnet with reboot policy FORCED
[21:18:05] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[21:18:25] <wikibugs>	 (03PS2) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366524)
[21:18:25] <wikibugs>	 (03PS3) 10Jdlrobson: [July 15th] Deploy dark mode to all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795)
[21:18:26] <wikibugs>	 (03PS3) 10Jdlrobson: [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150)
[21:19:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson)
[21:20:38] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1039 - jclark@cumin1002"
[21:21:42] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1039 - jclark@cumin1002"
[21:21:42] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:22:37] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1041.mgmt.eqiad.wmnet with reboot policy FORCED
[21:27:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936269 (10Jclark-ctr) cloudcephosd1039 2nd cable serial#20220008 port 1 cloudcephosd1040 2nd cable serial#20220043 port 5 cloudcephosd1041 2nd cable seria...
[21:30:27] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1040.mgmt.eqiad.wmnet with reboot policy FORCED
[21:33:56] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1039.mgmt.eqiad.wmnet with reboot policy FORCED
[21:34:18] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[21:34:40] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1041.mgmt.eqiad.wmnet with reboot policy FORCED
[21:35:47] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[21:36:45] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[21:38:48] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[21:38:55] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[21:41:16] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1040 - jclark@cumin1002"
[21:41:32] <wikibugs>	 (03PS1) 10Clare Ming: Add test streams for Metrics Platform app + web base instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050678 (https://phabricator.wikimedia.org/T366949)
[21:41:46] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye
[21:41:48] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1041.eqiad.wmnet with OS bullseye
[21:41:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye
[21:41:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936299 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye
[21:42:16] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1040 - jclark@cumin1002"
[21:42:17] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:42:28] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye
[21:42:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936301 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye
[21:44:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:45:06] <wikibugs>	 (03PS2) 10Clare Ming: Add test streams for Metrics Platform app + web base instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050678 (https://phabricator.wikimedia.org/T366949)
[21:49:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:58:27] <wikibugs>	 06SRE, 10DNS, 10fundraising-tech-ops, 06Traffic, 13Patch-For-Review: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9936334 (10Dwisehaupt) Adding @AKanji-WMF on this to coordinate with Major Gifts for the benefactors site.  Anil: The previous tasks associated with this are: T10...
[22:03:37] <icinga-wm>	 PROBLEM - Disk space on an-web1001 is CRITICAL: DISK CRITICAL - free space: /srv 27825 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops
[22:09:53] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1040']
[22:10:07] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1040']
[22:13:46] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1041']
[22:14:02] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1041']
[22:15:33] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:16:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936398 (10Jclark-ctr)
[22:17:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936403 (10Jclark-ctr) a:03Jclark-ctr
[22:24:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:32:07] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[22:36:57] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <wmf> for <Sharvaniharan> - https://phabricator.wikimedia.org/T368566#9936422 (10Ottomata) Yes, @Sharvaniharan will need analytics-privatedata-users access for that.   Approved!
[22:50:35] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudcephosd1039.eqiad.wmnet
[22:52:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936434 (10Papaul)
[23:09:15] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:16:33] <tzatziki>	 !log removing 1 image for legal compliance
[23:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:09] <logmsgbot>	 !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudcephosd1039.eqiad.wmnet
[23:18:17] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudcephosd1039.eqiad.wmnet
[23:21:51] <tzatziki>	 !log removing 1 image for legal compliance
[23:21:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:39] <logmsgbot>	 !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudcephosd1039.eqiad.wmnet
[23:23:50] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudcephosd1039.eqiad.wmnet
[23:28:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[23:29:25] <logmsgbot>	 !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudcephosd1039.eqiad.wmnet
[23:31:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update ip address for cloudcephosd1039 - pt1979@cumin2002"
[23:32:38] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update ip address for cloudcephosd1039 - pt1979@cumin2002"
[23:32:38] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:32:45] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudcephosd1039.eqiad.wmnet
[23:33:58] <logmsgbot>	 !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudcephosd1039.eqiad.wmnet
[23:34:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:38:35] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050680
[23:38:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050680 (owner: 10TrainBranchBot)
[23:42:37] <tzatziki>	 !log removing 1 image for legal compliance
[23:42:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:50:36] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[23:54:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002"
[23:55:12] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002"
[23:55:12] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:56:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[23:57:54] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:59:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable