[00:01:48] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050483 (owner: 10TrainBranchBot) [00:05:27] (03PS8) 10Jdlrobson: Enable action edit/submit and remaining special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366524) [00:06:06] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage [00:06:14] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [00:06:15] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [00:08:23] (03CR) 10Ssingh: "codfw and drmrs are not single-backend yet and have just one NVMe drive so we cannot unify all the configs just yet, sadly." [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall) [00:08:43] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage [00:10:45] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 374.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:29:16] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [00:29:43] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [00:33:00] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [00:33:02] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [00:33:14] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [00:33:33] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [00:36:15] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [00:36:41] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [00:40:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5024.eqsin.wmnet with OS bullseye [00:40:54] 10ops-eqsin, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9933201 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye compl... [00:45:07] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet [00:46:05] 10ops-eqsin, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9933202 (10BCornwall) [00:51:01] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [00:51:34] (03PS4) 10BCornwall: hiera: Unify all trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) [00:52:51] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [01:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:25:33] FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:29:15] RESOLVED: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:37:16] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [01:37:35] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [01:39:30] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [01:39:32] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [01:40:45] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 31.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:44:46] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [01:44:47] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [02:00:14] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [02:00:34] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [02:00:55] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [02:00:56] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [02:01:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:04:28] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [02:05:33] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:45] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:14:33] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [02:16:29] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [02:19:45] FIRING: Device rebooted: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [02:20:59] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368697 (10phaultfinder) 03NEW [02:23:12] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [02:23:17] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [02:24:45] RESOLVED: Device rebooted: Device ps1-a8-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [02:27:54] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [02:30:22] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [02:30:41] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [03:06:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:24:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:54:37] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:54:37] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:55:07] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:55:09] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:03:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9933337 (10Papaul) [05:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:31:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:15] (03PS3) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240628T0600) [06:01:20] (03PS2) 10Ayounsi: Cookbooks: fix Netbox 4 breaking changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) [06:02:37] (03CR) 10Ayounsi: "I only tested `sre.network.debug` but seeing how small the changes are, after proper review we can fix any remaining bugs once deployed to" [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:15] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:30:47] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 335.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:52:49] (03CR) 10Vgutierrez: [C:04-1] "all domains looking good from here but `wikimedia.ro`:" [dns] - 10https://gerrit.wikimedia.org/r/1050484 (owner: 10Ncmonitor) [06:58:47] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 16.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240628T0700) [07:09:50] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9933422 (10SGupta-WMF) Thank you @Scott_French and @mforns . I re-ran the pipe... [07:13:54] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9933424 (10SLyngshede-WMF) @Dzahn It's already on my todo :-) [07:19:33] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1050070 (https://phabricator.wikimedia.org/T367295) (owner: 10Dzahn) [07:54:22] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s4 [07:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:39] RECOVERY - Disk space on backup2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup2003&var-datasource=codfw+prometheus/ops [08:12:19] (03CR) 10Jelto: [C:03+2] Revert "aptrepo: revert gitlab-ce version to 16.11" [puppet] - 10https://gerrit.wikimedia.org/r/1050314 (https://phabricator.wikimedia.org/T365675) (owner: 10Jelto) [08:12:26] (03CR) 10Elukey: Tox: add python 3.12 support (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1050262 (owner: 10Ayounsi) [08:15:47] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:22:05] (03CR) 10Elukey: "Left some comments mostly to better understand the changes, but looks good! I assume that this patch will be merged only when we upgrade n" [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:29:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: Maintenance [08:29:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: Maintenance [08:29:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2204 (T367856)', diff saved to https://phabricator.wikimedia.org/P65543 and previous config saved to /var/cache/conftool/dbconfig/20240628-082946-marostegui.json [08:29:52] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:30:55] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9933522 (10elukey) Current status: * We are following up with Supermicro to customize the default root password for... [08:33:18] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850#9933530 (10cmooney) [08:34:35] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [08:37:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE)) [08:41:00] (03PS1) 10Btullis: Update the image used for the ceph-csi containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050566 (https://phabricator.wikimedia.org/T327259) [08:42:47] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:46:51] (03PS1) 10Slavina Stefanova: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 [08:55:31] (03CR) 10Ayounsi: "Overall lgtm, some inline comments." [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [09:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:11:02] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#9933562 (10elukey) [09:13:22] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9933563 (10cmooney) 05Open→03Resolved Thanks all for the help with this one! [09:14:07] (03CR) 10Brouberol: [C:03+1] "The tag as a timestamp is really nice! Thanks @bking@wikimedia.org" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050566 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:14:33] (03CR) 10Btullis: [C:03+2] Update the image used for the ceph-csi containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050566 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:17:32] (03Merged) 10jenkins-bot: Update the image used for the ceph-csi containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050566 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:19:21] (03PS5) 10Superpes15: [pswiki] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031963 (https://phabricator.wikimedia.org/T360851) [09:22:53] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:25:01] (03CR) 10JMeybohm: [C:03+1] deployment_server: Add a mwscript-k8s cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [09:28:19] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1050058 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [09:31:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:05] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:33:18] (03PS1) 10Elukey: admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366) [09:33:22] (03PS1) 10Elukey: admin_ng: upgrade cfssl-issuer's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050569 (https://phabricator.wikimedia.org/T368366) [09:33:26] (03PS1) 10Elukey: api,rest-gateway: upgrade Envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) [09:33:30] (03PS1) 10Elukey: admin_ng: update helm-state-metrics' Docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050571 (https://phabricator.wikimedia.org/T368366) [09:37:26] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:37:47] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:49:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [09:51:21] (03PS1) 10Hnowlan: thumbor: update 3d2png path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050573 (https://phabricator.wikimedia.org/T368301) [09:51:31] (03CR) 10Ayounsi: Homer: fix Netbox 4 breaking changes (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:52:13] (03PS2) 10Ayounsi: Tox: add python 3.12 support [software/homer] - 10https://gerrit.wikimedia.org/r/1050262 [09:52:13] (03PS4) 10Ayounsi: Homer: fix Netbox 4 breaking changes [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) [09:52:26] (03CR) 10Ayounsi: Tox: add python 3.12 support (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1050262 (owner: 10Ayounsi) [09:57:47] !log klausman@cumin2002 START - Cookbook sre.hosts.remove-downtime for ml-serve2007.codfw.wmnet [09:57:48] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-serve2007.codfw.wmnet [10:05:03] (03PS1) 10Btullis: Update the ceph-csi image to add missing libraries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050575 (https://phabricator.wikimedia.org/T327259) [10:05:36] (03CR) 10Brouberol: [C:03+1] Update the ceph-csi image to add missing libraries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050575 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [10:08:09] (03CR) 10Btullis: [C:03+2] Update the ceph-csi image to add missing libraries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050575 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [10:08:32] (03CR) 10Clément Goubert: [C:03+1] thumbor: update 3d2png path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050573 (https://phabricator.wikimedia.org/T368301) (owner: 10Hnowlan) [10:09:15] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:16] (03CR) 10Hnowlan: [C:03+2] thumbor: update 3d2png path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050573 (https://phabricator.wikimedia.org/T368301) (owner: 10Hnowlan) [10:09:25] (03CR) 10Ayounsi: "First (quick) pass." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [10:11:18] (03Merged) 10jenkins-bot: Update the ceph-csi image to add missing libraries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050575 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [10:12:21] (03Merged) 10jenkins-bot: thumbor: update 3d2png path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050573 (https://phabricator.wikimedia.org/T368301) (owner: 10Hnowlan) [10:12:24] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:16:54] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [10:17:01] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:17:17] (03CR) 10Klausman: [C:03+1] "LGTM, but I defer to Hugh on the final yea/nay." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:17:44] (03CR) 10Klausman: [C:03+1] admin_ng: upgrade cfssl-issuer's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050569 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:17:58] (03CR) 10Klausman: [C:03+1] admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:18:33] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:21:31] (03CR) 10Alexandros Kosiaris: [C:03+2] Add deploy1003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1050345 (https://phabricator.wikimedia.org/T364416) (owner: 10Clément Goubert) [10:22:24] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [10:22:29] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [10:22:34] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:30:17] (03PS1) 10Clément Goubert: Move 5 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1050579 (https://phabricator.wikimedia.org/T351074) [10:30:46] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [10:33:45] (03CR) 10Hnowlan: [C:03+1] Move 5 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1050579 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:34:20] (03CR) 10Cathal Mooney: [C:03+2] Update gnmic config to allow processing of all interface stats [puppet] - 10https://gerrit.wikimedia.org/r/1049242 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney) [10:34:34] (03CR) 10Hnowlan: [C:03+1] "lgtm, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:34:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9933846 (10MoritzMuehlenhoff) Let's directly install this server with Puppet 7, there should be no issues in the deployment-server manifests in terms of Puppet 5/7 compat at this point. [10:35:28] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:37:33] (03CR) 10Clément Goubert: [C:03+2] Move 5 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1050579 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:40:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1412 to wikikube-worker1027 [10:40:23] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:42:38] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1412 to wikikube-worker1027 - cgoubert@cumin1002" [10:43:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1412 to wikikube-worker1027 - cgoubert@cumin1002" [10:43:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:43:54] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1027 [10:43:55] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:44:02] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:44:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1027 [10:45:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1412 to wikikube-worker1027 [10:45:26] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1027.eqiad.wmnet on all recursors [10:45:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1027.eqiad.wmnet on all recursors [10:45:40] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:45:41] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye [10:46:01] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1413 to wikikube-worker1028 [10:46:06] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:48:32] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1413 to wikikube-worker1028 - cgoubert@cumin1002" [10:49:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1413 to wikikube-worker1028 - cgoubert@cumin1002" [10:49:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:49:41] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 [10:50:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 [10:50:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1413 to wikikube-worker1028 [10:50:53] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:51:13] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1028.eqiad.wmnet on all recursors [10:51:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1028.eqiad.wmnet on all recursors [10:51:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1028.eqiad.wmnet with OS bullseye [10:51:39] !log Running `foreachwikiindblist group1.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --batch-size=200` [10:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1417 to wikikube-worker1029 [10:51:58] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:54:09] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1417 to wikikube-worker1029 - cgoubert@cumin1002" [10:54:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [10:56:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1417 to wikikube-worker1029 - cgoubert@cumin1002" [10:56:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:56:22] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1029 [10:56:55] !log Stopped running script at `cawiki` [10:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1029 [10:58:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1417 to wikikube-worker1029 [10:58:30] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1027.eqiad.wmnet with OS bullseye [10:58:37] !log Running `foreachwikiindblist group1-wikipedia.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --batch-size=200` [10:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:07] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1029.eqiad.wmnet on all recursors [10:59:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1029.eqiad.wmnet on all recursors [10:59:57] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1029.eqiad.wmnet with OS bullseye [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240628T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240628T1100). [11:00:27] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1418 to wikikube-worker1030 [11:00:32] (03PS1) 10Btullis: Set the fsGroup to 900 for the ceph-csi provisioner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050585 (https://phabricator.wikimedia.org/T327259) [11:00:32] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:00:57] (03CR) 10Brouberol: [C:03+1] Set the fsGroup to 900 for the ceph-csi provisioner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050585 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:01:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye [11:02:52] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1418 to wikikube-worker1030 - cgoubert@cumin1002" [11:04:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1418 to wikikube-worker1030 - cgoubert@cumin1002" [11:04:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:04:13] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1030 [11:04:14] (03CR) 10Btullis: [C:03+2] Set the fsGroup to 900 for the ceph-csi provisioner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050585 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:05:41] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3498 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:06:05] ^ should resolve soon [11:06:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [11:06:43] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 108890 bytes in 1.031 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:06:52] (03CR) 10Ladsgroup: [C:03+1] bashrc: adds alias for ripgrep [puppet] - 10https://gerrit.wikimedia.org/r/1050398 (owner: 10Arnaudb) [11:06:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:07:01] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1027.eqiad.wmnet with OS bullseye [11:07:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye [11:08:05] (03Merged) 10jenkins-bot: Set the fsGroup to 900 for the ceph-csi provisioner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050585 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:08:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1030 [11:08:46] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1418 to wikikube-worker1030 [11:09:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1450 to wikikube-worker1031 [11:09:25] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:10:33] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:12] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1030.eqiad.wmnet on all recursors [11:11:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1030.eqiad.wmnet on all recursors [11:11:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:29] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1030.eqiad.wmnet with OS bullseye [11:11:56] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:11:57] !log `foreachwikiindblist group1-wikipedia.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --batch-size=200` finished running [11:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:56] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) [11:13:16] !log Running `foreachwikiindblist medium.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --batch-size=200` for T366781. `medium.dblist` does not include `loginwiki` or `metawiki` (which are to be done later). [11:13:21] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) (duration: 00m 25s) [11:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:21] T366781: Run maintenance script to delete entries only for use when reading old on WMF wikis - https://phabricator.wikimedia.org/T366781 [11:14:15] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:42] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1450 to wikikube-worker1031 - cgoubert@cumin1002" [11:15:29] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1028.eqiad.wmnet with OS bullseye [11:15:42] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1028.eqiad.wmnet with OS bullseye [11:16:04] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:16:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1450 to wikikube-worker1031 - cgoubert@cumin1002" [11:16:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:16:06] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1031 [11:17:05] (03PS2) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046678 [11:17:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1031 [11:17:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1450 to wikikube-worker1031 [11:18:05] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1031.eqiad.wmnet on all recursors [11:18:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1031.eqiad.wmnet on all recursors [11:18:21] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1031.eqiad.wmnet with OS bullseye [11:23:57] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1028.eqiad.wmnet with OS bullseye [11:24:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1028.eqiad.wmnet with OS bullseye [11:25:50] (03CR) 10Jforrester: [C:03+2] Optimize static footer 'a Wikimedia project' icon further [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047521 (https://phabricator.wikimedia.org/T256190) (owner: 10VolkerE) [11:25:59] (03CR) 10Jforrester: [C:03+1] "Bah." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047521 (https://phabricator.wikimedia.org/T256190) (owner: 10VolkerE) [11:26:16] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:26:29] (03CR) 10Ladsgroup: [C:03+1] "I'm planning to deploy this on Monday. Sorry for missing this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047521 (https://phabricator.wikimedia.org/T256190) (owner: 10VolkerE) [11:28:05] (03CR) 10Hnowlan: [C:03+1] mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [11:28:12] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/991452 (owner: 10PipelineBot) [11:28:16] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/992662 (owner: 10PipelineBot) [11:28:20] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/995362 (owner: 10PipelineBot) [11:28:23] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006875 (owner: 10PipelineBot) [11:28:27] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016384 (owner: 10PipelineBot) [11:28:31] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019766 (owner: 10PipelineBot) [11:28:34] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022033 (owner: 10PipelineBot) [11:28:38] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030557 (owner: 10PipelineBot) [11:28:41] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031594 (owner: 10PipelineBot) [11:28:45] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047465 (owner: 10PipelineBot) [11:28:55] (03PS2) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049506 [11:29:13] (03PS1) 10Btullis: Increase the eventgate canary log_level to trace, temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050588 (https://phabricator.wikimedia.org/T368495) [11:29:14] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) [11:29:36] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1029.eqiad.wmnet with OS bullseye [11:29:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1029.eqiad.wmnet with OS bullseye [11:29:59] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) (duration: 00m 44s) [11:30:25] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1027.eqiad.wmnet with OS bullseye [11:30:59] (03CR) 10Phuedx: [C:03+1] Increase the eventgate canary log_level to trace, temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050588 (https://phabricator.wikimedia.org/T368495) (owner: 10Btullis) [11:31:14] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye [11:31:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1031.eqiad.wmnet with reason: host reimage [11:35:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1031.eqiad.wmnet with reason: host reimage [11:37:47] (03CR) 10Btullis: [C:03+2] Increase the eventgate canary log_level to trace, temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050588 (https://phabricator.wikimedia.org/T368495) (owner: 10Btullis) [11:38:23] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1030.eqiad.wmnet with OS bullseye [11:38:33] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1030.eqiad.wmnet with OS bullseye [11:38:38] (03Merged) 10jenkins-bot: Increase the eventgate canary log_level to trace, temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050588 (https://phabricator.wikimedia.org/T368495) (owner: 10Btullis) [11:41:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050590 [11:41:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050590 (owner: 10TrainBranchBot) [11:44:04] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1029.eqiad.wmnet with OS bullseye [11:44:07] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [11:44:11] (03CR) 10Hnowlan: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [11:44:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1029.eqiad.wmnet with OS bullseye [11:44:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1028.eqiad.wmnet with reason: host reimage [11:45:15] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [11:45:29] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [11:45:40] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [11:46:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:47:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:47:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1028.eqiad.wmnet with reason: host reimage [11:49:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:50:43] !log Finished run on `medium.dblist` [11:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:43] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1030.eqiad.wmnet with reason: host reimage [11:54:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1031.eqiad.wmnet with OS bullseye [11:55:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1030.eqiad.wmnet with reason: host reimage [11:57:34] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:59:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:18] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1027.eqiad.wmnet with OS bullseye [12:05:44] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye [12:06:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1028.eqiad.wmnet with OS bullseye [12:08:18] PROBLEM - Host wikikube-worker1028 is DOWN: PING CRITICAL - Packet loss = 100% [12:10:46] RECOVERY - Host wikikube-worker1028 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:13:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1030.eqiad.wmnet with OS bullseye [12:14:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T367856)', diff saved to https://phabricator.wikimedia.org/P65544 and previous config saved to /var/cache/conftool/dbconfig/20240628-121404-marostegui.json [12:14:11] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:15:13] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [12:17:38] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for an-conf1005,6 - jclark@cumin1002" [12:18:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for an-conf1005,6 - jclark@cumin1002" [12:18:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:21:19] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-conf1004 [12:21:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050590 (owner: 10TrainBranchBot) [12:23:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-conf1004 [12:23:24] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-conf1006 [12:24:37] (03PS1) 10Hashar: Update Gerrit 3.10 snapshot [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050595 (https://phabricator.wikimedia.org/T367029) [12:24:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-conf1006 [12:29:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P65545 and previous config saved to /var/cache/conftool/dbconfig/20240628-122911-marostegui.json [12:32:28] (03PS1) 10TChin: EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) [12:34:20] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9934200 (10cmooney) @fgiunchedi I was perhaps a little cheeky and merged this, but it was clear the volume of new metrics was well withi... [12:35:25] (03CR) 10Elukey: [V:03+2 C:03+2] "Done" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [12:35:26] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1029.eqiad.wmnet with OS bullseye [12:35:58] (03PS1) 10Alexandros Kosiaris: deploy1003: Switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1050597 (https://phabricator.wikimedia.org/T364416) [12:37:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1029.eqiad.wmnet with OS bullseye [12:37:36] (03PS1) 10Cathal Mooney: Change gnmi sampling interval and enable timestamps for prom output [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322) [12:39:08] (03CR) 10Hashar: [C:03+2] Update Gerrit 3.10 snapshot [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050595 (https://phabricator.wikimedia.org/T367029) (owner: 10Hashar) [12:39:38] (03Merged) 10jenkins-bot: Update Gerrit 3.10 snapshot [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050595 (https://phabricator.wikimedia.org/T367029) (owner: 10Hashar) [12:40:19] (03CR) 10Elukey: Homer: fix Netbox 4 breaking changes (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [12:43:56] (03CR) 10Arnaudb: [C:03+2] bashrc: adds alias for ripgrep [puppet] - 10https://gerrit.wikimedia.org/r/1050398 (owner: 10Arnaudb) [12:44:12] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-conf1004 [12:44:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:44:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-conf1004 [12:44:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P65546 and previous config saved to /var/cache/conftool/dbconfig/20240628-124419-marostegui.json [12:44:40] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1027.eqiad.wmnet with OS bullseye [12:45:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:45:42] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye [12:48:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [12:48:18] PROBLEM - Host wikikube-worker1028 is DOWN: PING CRITICAL - Packet loss = 100% [12:48:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [12:50:06] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1029.eqiad.wmnet with reason: host reimage [12:50:46] RECOVERY - Host wikikube-worker1028 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [12:51:54] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9934251 (10Jclark-ctr) [12:51:58] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9934253 (10mforns) Yay! Thanks @SGupta-WMF [12:52:04] (03PS1) 10Elukey: TESTING ONLY - profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) [12:53:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1029.eqiad.wmnet with reason: host reimage [12:53:15] I am going to upgrade Gerrit to apply some patches for regressions we have discovered over the week [12:53:22] (03CR) 10Alexandros Kosiaris: [C:03+2] deploy1003: Switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1050597 (https://phabricator.wikimedia.org/T364416) (owner: 10Alexandros Kosiaris) [12:53:23] that will be a short downtime [12:53:33] (03PS2) 10Elukey: TESTING ONLY - profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) [12:54:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:54:52] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [12:55:19] !log hashar@deploy1002 Started deploy [gerrit/gerrit@0db053e]: Upgrade Gerrit 3.10.0-32-gf77960412e to 3.10.0-71-gf6e9431fff - T367029 T341291 [12:55:26] T367029: "Press c to comment" is placed incorrectly when using Firefox 126 and 128 on macOS - https://phabricator.wikimedia.org/T367029 [12:55:26] T341291: Install gerrit image-diff plugin - https://phabricator.wikimedia.org/T341291 [12:55:28] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@0db053e]: Upgrade Gerrit 3.10.0-32-gf77960412e to 3.10.0-71-gf6e9431fff - T367029 T341291 (duration: 00m 09s) [12:55:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:57:53] (03PS3) 10Elukey: profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) [12:59:02] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3108/" [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [12:59:02] I am stopping Gerrit NOW [12:59:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T367856)', diff saved to https://phabricator.wikimedia.org/P65547 and previous config saved to /var/cache/conftool/dbconfig/20240628-125926-marostegui.json [12:59:35] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:01:08] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host deploy1003.eqiad.wmnet with OS bookworm [13:03:08] (03PS25) 10DCausse: wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [13:03:08] (03PS5) 10DCausse: wdqs: enable throttling only for requests coming from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) [13:03:52] (03CR) 10CI reject: [V:04-1] wdqs: enable throttling only for requests coming from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [13:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:05:53] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1028.eqiad.wmnet with reason: mgmt ip issue [13:06:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1028.eqiad.wmnet with reason: mgmt ip issue [13:09:44] (03CR) 10Muehlenhoff: [C:03+1] "One nit, looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:10:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:11:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1029.eqiad.wmnet with OS bullseye [13:11:14] (03CR) 10Elukey: [V:03+1] profile::puppetserver::git: add an option to exclude servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:11:31] (03PS4) 10Elukey: profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) [13:11:43] (03CR) 10Elukey: profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:12:17] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage [13:12:38] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3109/" [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:13:26] (03PS26) 10DCausse: wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [13:13:27] (03PS6) 10DCausse: wdqs: enable throttling only for requests coming from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) [13:14:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:15:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage [13:16:38] (03CR) 10CI reject: [V:04-1] wdqs: enable throttling only for requests coming from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [13:17:45] (03CR) 10JHathaway: [C:03+2] postfix: add gitlab to recipient discards [puppet] - 10https://gerrit.wikimedia.org/r/1050058 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [13:18:24] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1027.mgmt.eqiad.wmnet on all recursors [13:18:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1027.mgmt.eqiad.wmnet on all recursors [13:18:54] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1028.mgmt.eqiad.wmnet on all recursors [13:18:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1028.mgmt.eqiad.wmnet on all recursors [13:19:55] (03PS1) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) [13:20:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:21:02] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [13:21:18] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3110/" [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:22:12] (03CR) 10Ssingh: [C:03+1] "Looks good! We should definitely run PCC on one each of text/upload in all sites, just to be extra sure before rolling this out. Or maybe " [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall) [13:24:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:24:29] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for wikikube-worker102[7-8] - cmooney@cumin1002" [13:25:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for wikikube-worker102[7-8] - cmooney@cumin1002" [13:25:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:26:53] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1027.eqiad.wmnet with OS bullseye [13:27:22] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1027.mgmt.eqiad.wmnet on all recursors [13:27:25] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1027.mgmt.eqiad.wmnet on all recursors [13:27:31] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1028.mgmt.eqiad.wmnet on all recursors [13:27:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1028.mgmt.eqiad.wmnet on all recursors [13:28:22] !log running `decommission` on 5 codfw api appservers [13:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:15] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1002" [13:29:59] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bullseye [13:31:22] (03PS2) 10Dzahn: admin: add jsn to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1050070 (https://phabricator.wikimedia.org/T367295) [13:32:06] (03PS2) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) [13:32:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host krb1002.eqiad.wmnet with OS bookworm [13:33:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9934374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host krb1002.eqiad.wmnet with OS bookworm [13:33:14] (03PS1) 10Btullis: Revert "Increase the eventgate canary log_level to trace, temporarily" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050608 [13:33:21] (03CR) 10Dzahn: [C:03+2] admin: add jsn to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1050070 (https://phabricator.wikimedia.org/T367295) (owner: 10Dzahn) [13:33:23] (03PS2) 10Btullis: Revert "Increase the eventgate canary log_level to trace, temporarily" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050608 (https://phabricator.wikimedia.org/T368495) [13:34:54] (03CR) 10Btullis: [C:03+2] Revert "Increase the eventgate canary log_level to trace, temporarily" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050608 (https://phabricator.wikimedia.org/T368495) (owner: 10Btullis) [13:34:58] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 379.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:35:22] (03PS3) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) [13:35:49] (03Merged) 10jenkins-bot: Revert "Increase the eventgate canary log_level to trace, temporarily" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050608 (https://phabricator.wikimedia.org/T368495) (owner: 10Btullis) [13:35:59] (03CR) 10JHathaway: [C:03+1] "one suggestion, otherwise looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:36:54] (03PS1) 10Hnowlan: kubernetes: move 5 codfw api appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1050609 (https://phabricator.wikimedia.org/T351074) [13:37:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295#9934379 (10Dzahn) 05In progress→03Resolved @jsn.sherman You have now been added to the group as requested. Feel free... [13:37:50] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [13:38:04] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [13:38:17] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [13:38:27] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [13:38:31] (03PS1) 10DDesouza: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050610 (https://phabricator.wikimedia.org/T344471) [13:38:53] (03CR) 10CI reject: [V:04-1] role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:39:33] (03CR) 10Clément Goubert: [C:03+1] kubernetes: move 5 codfw api appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1050609 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [13:39:39] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050610 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [13:39:50] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368697#9934399 (10Dzahn) [13:39:56] (03CR) 10Hnowlan: [C:03+2] kubernetes: move 5 codfw api appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1050609 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [13:40:06] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368697#9934403 (10Dzahn) →14Duplicate dup:03T368648 [13:40:18] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368648#9934401 (10Dzahn) [13:40:53] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050610 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [13:41:10] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1002" [13:41:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host deploy1003.eqiad.wmnet with OS bookworm [13:41:36] (03PS4) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) [13:41:46] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:42:03] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:42:04] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:42:08] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host deploy1003.eqiad.wmnet with OS bullseye [13:42:36] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:42:37] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:42:45] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:42:56] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:43:18] (03PS2) 10Slavina Stefanova: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 [13:43:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1027.eqiad.wmnet with reason: host reimage [13:43:58] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:44:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688#9934431 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:44:42] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9934434 (10Dzahn) @Ottomata Sharvaniharan requested access to superset and the "wmf" group here but I noticed she already has that. Based on the description "Currently dashboard is not loading... [13:44:54] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9934439 (10Dzahn) a:05Dzahn→03None [13:45:09] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9934442 (10Dzahn) 05In progress→03Open [13:45:13] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on krb1002.eqiad.wmnet with reason: host reimage [13:45:55] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2298 to wikikube-worker2025 [13:46:08] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9934454 (10Dzahn) [13:46:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1027.eqiad.wmnet with reason: host reimage [13:46:12] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [13:46:20] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2300 to wikikube-worker2026 [13:46:38] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw2300 to wikikube-worker2026 [13:47:01] (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (owner: 10Slavina Stefanova) [13:49:02] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2298 to wikikube-worker2025 - hnowlan@cumin1002" [13:49:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb1002.eqiad.wmnet with reason: host reimage [13:50:13] (03PS1) 10Hashar: gerrit: enable "new" image diff UI [puppet] - 10https://gerrit.wikimedia.org/r/1050614 (https://phabricator.wikimedia.org/T341291) [13:53:19] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage [13:54:00] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2298 to wikikube-worker2025 - hnowlan@cumin1002" [13:54:00] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:54:00] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2025 [13:54:24] (03PS5) 10Elukey: profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) [13:54:24] (03PS5) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) [13:54:25] (03PS1) 10Btullis: ceph-csi: revert fsGroup change and disable metrics container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050615 (https://phabricator.wikimedia.org/T327259) [13:54:30] (03CR) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [13:54:34] (03CR) 10Elukey: profile::puppetserver::git: add an option to exclude servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:56:54] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage [13:56:57] (03CR) 10DCausse: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [13:57:23] (03CR) 10Brouberol: [C:03+1] ceph-csi: revert fsGroup change and disable metrics container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050615 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:57:27] (03CR) 10Hashar: "I think it is fine to enable at anytime though I am pretty sure the Gerrit daemon requires to be restarted to apply the change." [puppet] - 10https://gerrit.wikimedia.org/r/1050614 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [13:58:15] (03CR) 10JHathaway: [C:03+2] temporarily add mx-in1001 as an MX server [dns] - 10https://gerrit.wikimedia.org/r/1050426 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [13:58:48] (03CR) 10Btullis: [C:03+2] ceph-csi: revert fsGroup change and disable metrics container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050615 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:59:22] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2025 [13:59:30] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2298 to wikikube-worker2025 [14:00:37] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2306 to wikikube-worker2027 [14:00:43] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [14:01:31] !log ingressing email on mx-in1001, initial test 1hr [14:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:47] (03Merged) 10jenkins-bot: ceph-csi: revert fsGroup change and disable metrics container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050615 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:03:14] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2306 to wikikube-worker2027 - hnowlan@cumin1002" [14:04:15] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:04:40] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:05:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1027.eqiad.wmnet with OS bullseye [14:05:38] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on mx-in1001.wikimedia.org with reason: email testing [14:06:02] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-in1001.wikimedia.org with reason: email testing [14:06:58] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2306 to wikikube-worker2027 - hnowlan@cumin1002" [14:06:58] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:06:58] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2027 [14:06:59] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:07:12] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2027 [14:07:21] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2306 to wikikube-worker2027 [14:07:59] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2308 to wikikube-worker2028 [14:08:04] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [14:09:01] (03PS1) 10Kosta Harlan: IPReputation: Enable extension on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050619 (https://phabricator.wikimedia.org/T360067) [14:10:38] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host deploy1003.eqiad.wmnet with OS bullseye [14:12:02] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2308 to wikikube-worker2028 - hnowlan@cumin1002" [14:14:15] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:36] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:21:59] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2308 to wikikube-worker2028 - hnowlan@cumin1002" [14:21:59] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:22:00] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2028 [14:22:10] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2028 [14:22:19] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2308 to wikikube-worker2028 [14:22:37] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2330 to wikikube-worker2029 [14:22:40] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:22:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host krb1002.eqiad.wmnet with OS bookworm [14:22:43] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [14:22:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9934580 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host krb1002.eqiad.wmnet with OS bookworm completed: - krb1002 (**WARN**)... [14:23:20] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [14:23:48] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [14:23:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9934586 (10Jclark-ctr) [14:24:12] (03PS1) 10Ssingh: durum: remove redundant anycast_peers [puppet] - 10https://gerrit.wikimedia.org/r/1050620 [14:24:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9934589 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:24:47] (03PS2) 10Ssingh: durum: remove redundant override for profile::systemd::timesyncd::ntp_servers [puppet] - 10https://gerrit.wikimedia.org/r/1050620 [14:24:56] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [14:25:16] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2330 to wikikube-worker2029 - hnowlan@cumin1002" [14:25:20] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3114/console" [puppet] - 10https://gerrit.wikimedia.org/r/1050620 (owner: 10Ssingh) [14:25:24] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [14:26:15] (03CR) 10Ssingh: [V:03+1 C:03+2] durum: remove redundant override for profile::systemd::timesyncd::ntp_servers [puppet] - 10https://gerrit.wikimedia.org/r/1050620 (owner: 10Ssingh) [14:26:48] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2330 to wikikube-worker2029 - hnowlan@cumin1002" [14:26:48] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:26:48] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2029 [14:27:13] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2029 [14:27:22] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2330 to wikikube-worker2029 [14:27:27] !log sudo cumin "O:durum" "run-puppet-agent" [14:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:15] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2025.codfw.wmnet with OS bullseye [14:28:50] (03PS1) 10Btullis: cephcsi: disable the metrics container in the nodeplugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050622 (https://phabricator.wikimedia.org/T327259) [14:29:11] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2027.codfw.wmnet with OS bullseye [14:30:00] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2028.codfw.wmnet with OS bullseye [14:30:28] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2029.codfw.wmnet with OS bullseye [14:30:46] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368732 (10phaultfinder) 03NEW [14:33:47] (03CR) 10Btullis: [C:03+2] cephcsi: disable the metrics container in the nodeplugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050622 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:33:55] (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver: Final mw-on-k8s cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:34:11] (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver: Cleanup mw-on-k8s scripts [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:34:15] FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:34:49] (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver::lua_script: Implement ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1050293 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:35:33] FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:43] (03Merged) 10jenkins-bot: cephcsi: disable the metrics container in the nodeplugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050622 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:39:15] FIRING: [5x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:34] (03PS5) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [14:41:13] (03PS1) 10Btullis: Configure fsgroup for the cephcsi nodeplugin pod to be 900 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) [14:41:46] (03PS59) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [14:41:50] 06SRE, 10Cloud-Services, 06DBA, 07Tracking-Neverending: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#9934676 (10sguebo_WMF) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikime... [14:41:57] (03CR) 10CI reject: [V:04-1] Configure fsgroup for the cephcsi nodeplugin pod to be 900 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:42:01] (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [14:42:38] (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [14:42:42] (03CR) 10Scott French: [C:03+2] services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [14:43:34] (03Merged) 10jenkins-bot: services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [14:44:19] (03PS1) 10Ssingh: hiera: dns6001: reduce anycast_hc logging level and backups [puppet] - 10https://gerrit.wikimedia.org/r/1050626 [14:45:05] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2025.codfw.wmnet with reason: host reimage [14:45:21] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050626 (owner: 10Ssingh) [14:45:24] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2027.codfw.wmnet with reason: host reimage [14:46:32] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2028.codfw.wmnet with reason: host reimage [14:46:51] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2029.codfw.wmnet with reason: host reimage [14:47:55] (03CR) 10Cathal Mooney: "Thanks! Updated based on some of the feedback. I'll delve into the get_link_data function to try and address the concerns about using th" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [14:47:59] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2025.codfw.wmnet with reason: host reimage [14:48:15] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3116/console" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall) [14:49:42] (03PS1) 10Alexandros Kosiaris: deploy1003: Assign role [puppet] - 10https://gerrit.wikimedia.org/r/1050628 (https://phabricator.wikimedia.org/T364416) [14:50:22] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2029.codfw.wmnet with reason: host reimage [14:51:03] (03PS2) 10Btullis: Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) [14:51:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:33] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3118/console" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall) [14:51:36] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#9934720 (10elukey) The safest bet is to use `python3-confluent-kafka` in my opinion, it is pa... [14:51:38] (03PS3) 10Btullis: Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) [14:51:42] (03CR) 10CI reject: [V:04-1] Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:52:54] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2028.codfw.wmnet with reason: host reimage [14:53:37] (03CR) 10Btullis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:54:18] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [14:55:11] (03CR) 10Brouberol: [C:03+1] Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:55:15] (03CR) 10Btullis: [C:03+2] Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:55:21] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1659 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:56:16] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2027.codfw.wmnet with reason: host reimage [14:56:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:19] (03Merged) 10jenkins-bot: Configure the user of the csi-rbdplugin container to be 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050625 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:58:31] (03PS1) 10JHathaway: Revert "temporarily add mx-in1001 as an MX server" [dns] - 10https://gerrit.wikimedia.org/r/1050630 [14:58:48] (03PS3) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) [14:58:55] (03CR) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [14:58:57] (03CR) 10CDanis: [C:03+1] Revert "temporarily add mx-in1001 as an MX server" [dns] - 10https://gerrit.wikimedia.org/r/1050630 (owner: 10JHathaway) [14:59:15] FIRING: [5x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:18] (03PS1) 10AikoChou: ml-services: enable ALLOW_REVISION_JSON_INPUT for revertrisk in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050631 (https://phabricator.wikimedia.org/T356102) [15:00:19] 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9934769 (10spatton) Hey teammates and cc @Dzahn as you recommended us posting here in regard to the issue @Pcoombe found and reported in T368645. Can we update this task's description... [15:00:19] (03CR) 10JHathaway: [C:03+2] Revert "temporarily add mx-in1001 as an MX server" [dns] - 10https://gerrit.wikimedia.org/r/1050630 (owner: 10JHathaway) [15:00:38] !log homer 'cr*eqiad*' commit 'T351074' [15:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:44] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:00:48] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:02:51] (03CR) 10BCornwall: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall) [15:04:22] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [15:05:01] (03PS1) 10Btullis: Fix the values.yaml file for the cephcsi deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050633 (https://phabricator.wikimedia.org/T327259) [15:05:07] (03PS1) 10Scott French: commons-impact-analytics: correct binary name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050634 (https://phabricator.wikimedia.org/T361835) [15:05:33] FIRING: [5x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:55] !log mx-in1001 postfix mx testing complete [15:06:56] (03CR) 10Hnowlan: [C:03+1] commons-impact-analytics: correct binary name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050634 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:00] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2025.codfw.wmnet with OS bullseye [15:08:15] (03CR) 10JHathaway: profile::puppetserver::git: add an option to exclude servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [15:08:33] (03CR) 10Scott French: [C:03+2] commons-impact-analytics: correct binary name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050634 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:08:45] (03CR) 10Btullis: [C:03+2] Fix the values.yaml file for the cephcsi deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050633 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [15:09:05] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [15:09:15] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:15] FIRING: [5x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:20] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2029.codfw.wmnet with OS bullseye [15:09:35] (03Merged) 10jenkins-bot: commons-impact-analytics: correct binary name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050634 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:10:09] !log Pooling and uncordoning wikikube-worker1027.eqiad.wmnet,wikikube-worker1028.eqiad.wmnet,wikikube-worker1029.eqiad.wmnet,wikikube-worker1030.eqiad.wmnet,wikikube-worker1031.eqiad.wmnet - T351074 [15:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:15] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:10:18] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker1027.eqiad.wmnet|wikikube-worker1028.eqiad.wmnet|wikikube-worker1029.eqiad.wmnet|wikikube-worker1030.eqiad.wmnet|wikikube-worker1031.eqiad.wmnet),cluster=kubernetes,service=kubesvc [15:10:24] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2028.codfw.wmnet with OS bullseye [15:11:03] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:11:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T368639#9934831 (10Clement_Goubert) [15:11:20] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [15:11:56] (03Merged) 10jenkins-bot: Fix the values.yaml file for the cephcsi deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050633 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [15:12:13] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 4 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall) [15:12:30] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:14:17] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2027.codfw.wmnet with OS bullseye [15:14:27] !log homer 'cr*codfw*' commit 'T351074' [15:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:21:38] !log upgraded wikitech-static to 1_42 and php 8.3 [15:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:47] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [15:22:40] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:23:03] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 734.75 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:23:24] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29660 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [15:25:49] 06SRE, 10Data-Services, 06DBA, 07Tracking-Neverending: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#9934899 (10JJMC89) [15:25:52] !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2025.codfw.wmnet|wikikube-worker2027.codfw.wmnet|wikikube-worker2028.codfw.wmnet|wikikube-worker2029.codfw.wmnet),cluster=kubernetes,service=kubesvc [15:27:47] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [15:29:22] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T368743 (10hnowlan) 03NEW [15:31:38] 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744 (10elukey) 03NEW [15:31:56] (03CR) 10DCausse: wdqs: enable throttling only for requests coming from the CDN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [15:32:03] (03PS2) 10Elukey: Allow to save new OS names without them being present on the DB [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) [15:32:37] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:34:27] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9934956 (10elukey) `docker-reporter-base-images.service` on build2001 reports an issue with the dec-puppet-client image: ` [2024-06-28T04... [15:34:52] 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9934962 (10Ladsgroup) Let me fix the case of donate.m. [15:35:05] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for wikikube-worker2026 - cmooney@cumin1002" [15:39:03] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:39:19] (03PS1) 10Ladsgroup: wikimedia.org: Set CNAME record for donate.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882) [15:40:01] (03CR) 10Ssingh: "Looks good but probably best if it goes down with the rest of the CNAMEs." [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [15:41:32] (03PS2) 10Ladsgroup: wikimedia.org: Set CNAME record for donate.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882) [15:42:24] (03CR) 10Cathal Mooney: "Yeah it's a good point, I hadn't considered the cookbook as an alternate automation pipeline." [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [15:42:35] (03CR) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [15:43:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for wikikube-worker2026 - cmooney@cumin1002" [15:43:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:43:18] 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9935007 (10greg) Thanks Amir! [15:45:08] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bullseye [15:47:35] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 378301568 and 29 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:48:08] (03PS3) 10Ladsgroup: wikimedia.org: Set CNAME record for donate.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882) [15:48:31] (03CR) 10Ssingh: [C:03+1] wikimedia.org: Set CNAME record for donate.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [15:48:35] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:49:51] (03CR) 10Ladsgroup: "sure!" [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [15:50:01] (03CR) 10Ladsgroup: [C:03+2] wikimedia.org: Set CNAME record for donate.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050641 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [15:51:32] 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9935064 (10Ladsgroup) reloading zones now. [15:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:45] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on mw2300.codfw.wmnet with reason: Reimaging issues [16:03:47] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on mw2300.codfw.wmnet with reason: Reimaging issues [16:06:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:43] 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9935118 (10greg) [16:11:14] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: enable ALLOW_REVISION_JSON_INPUT for revertrisk in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050631 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [16:11:20] (03PS1) 10Btullis: cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050644 (https://phabricator.wikimedia.org/T327259) [16:11:44] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9935144 (10Scott_French) Thanks so much, @SGupta-WMF. Alright, so I think we'... [16:12:10] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9935154 (10Scott_French) [16:12:33] (03CR) 10AikoChou: [C:03+2] ml-services: enable ALLOW_REVISION_JSON_INPUT for revertrisk in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050631 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [16:12:49] (03CR) 10BCornwall: [V:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall) [16:13:27] (03Merged) 10jenkins-bot: ml-services: enable ALLOW_REVISION_JSON_INPUT for revertrisk in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050631 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [16:14:33] (03CR) 10Btullis: [C:03+2] cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050644 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:16:34] (03PS6) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [16:17:02] (03CR) 10Ssingh: [C:03+1] "Looks good, PCC looks clean!" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall) [16:17:29] (03Merged) 10jenkins-bot: cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050644 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:17:53] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9935199 (10elukey) On db1195 I see for `emacs-nox`: ` MariaDB [debmonitor]> select * from bin_packages_package where name = 'emacs-nox'... [16:18:22] (03CR) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [16:19:59] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [16:23:07] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:25:44] (03PS1) 10Btullis: cephcsi: correct image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050645 (https://phabricator.wikimedia.org/T327259) [16:26:30] FIRING: ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:28:51] (03CR) 10Btullis: [C:03+2] cephcsi: correct image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050645 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:29:36] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9935247 (10elukey) Ok I see, I ran debmonitor inside the dcl image: ` "os": "Debian 12", "uninstalled": [], "update_type": "f... [16:30:16] (03CR) 10RLazarus: [C:03+2] deployment_server: Add a mwscript-k8s cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [16:30:41] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9935249 (10fnegri) @bd808 @Ladsgroup thanks for your replies! I will reiterate that the general goal is to make root access to clouddb* hosts as saf... [16:31:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:31:30] RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:31:48] (03Merged) 10jenkins-bot: cephcsi: correct image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050645 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:33:32] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:34:15] FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:46] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9935262 (10xcollazo) https://gerrit.wikimedia.org/r/1049617, which was pointed... [16:35:33] FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:18] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:36:35] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:38:37] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9935304 (10Sharvaniharan) Hi @Dzahn @Ottomata If it helps, here is the full error text I am getting on my dashboard: https://docs.google.com/document/d/1A5VF4mbhCQIWPHHbylIGLJq6EGirn1ZYHXsXDPuY... [16:41:58] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368648#9935311 (10VRiley-WMF) [16:42:14] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368732#9935309 (10VRiley-WMF) →14Duplicate dup:03T368648 [16:43:12] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-worker2026.codfw.wmnet with OS bullseye [16:49:58] (03PS7) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [16:51:24] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368648#9935338 (10VRiley-WMF) 05Open→03Resolved [16:51:27] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368648#9935337 (10VRiley-WMF) This has been corrected by adjusting the power cable [17:03:03] (03PS1) 10Btullis: cephcsi: Use fsGroup 900 to allow /csi/csi.sock to be shared [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050648 (https://phabricator.wikimedia.org/T327259) [17:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:04:15] FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:09:00] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9935383 (10Dzahn) As someone involved in disabling the dumps services and pupp... [17:09:15] FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:10:08] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050649 (https://phabricator.wikimedia.org/T368462) [17:34:15] FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:44:15] FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:45:33] FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:51:39] (03PS2) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050649 (https://phabricator.wikimedia.org/T368462) [17:59:40] (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050649 (https://phabricator.wikimedia.org/T368462) (owner: 10Santiago Faci) [18:00:28] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050649 (https://phabricator.wikimedia.org/T368462) (owner: 10Santiago Faci) [18:00:51] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050656 (https://phabricator.wikimedia.org/T368462) [18:01:04] (03PS1) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) [18:04:11] (03PS2) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050656 (https://phabricator.wikimedia.org/T368462) [18:05:16] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [18:05:33] FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:05:34] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [18:08:06] (03PS2) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) [18:08:53] (03PS3) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) [18:09:44] (03PS4) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) [18:09:47] (03CR) 10Ladsgroup: [C:04-1] varnish: redirect donate.wm.org Special:LandingPage to / (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [18:10:15] (03CR) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [18:10:53] (03CR) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [18:13:19] (03PS2) 10Cathal Mooney: Change gnmi sampling interval and enable timestamps for prom output [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322) [18:13:25] (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050656 (https://phabricator.wikimedia.org/T368462) (owner: 10Santiago Faci) [18:14:05] (03PS1) 10RLazarus: deployment_server: mwscript-cleanup fixes [puppet] - 10https://gerrit.wikimedia.org/r/1050661 [18:14:15] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050656 (https://phabricator.wikimedia.org/T368462) (owner: 10Santiago Faci) [18:16:19] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [18:16:36] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [18:16:44] (03PS5) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) [18:17:46] (03CR) 10CI reject: [V:04-1] deployment_server: mwscript-cleanup fixes [puppet] - 10https://gerrit.wikimedia.org/r/1050661 (owner: 10RLazarus) [18:18:09] (03CR) 10Ladsgroup: [C:03+1] varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [18:18:48] (03PS2) 10RLazarus: deployment_server: mwscript-cleanup fixes [puppet] - 10https://gerrit.wikimedia.org/r/1050661 [18:18:56] (03PS1) 10Ladsgroup: Revert "wikimedia.org: Set CNAME record for donate.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1050662 [18:19:43] (03Abandoned) 10Ladsgroup: Revert "wikimedia.org: Set CNAME record for donate.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1050662 (owner: 10Ladsgroup) [18:19:48] (03PS3) 10Cathal Mooney: Change gnmi sampling interval and enable timestamps for prom output [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322) [18:20:33] RESOLVED: [4x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:21:16] (03PS6) 10Ssingh: varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) [18:22:03] !log disable puppet on A:cp-text [18:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:12] (03PS2) 10Scott French: eventstreams: adopt base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) [18:22:23] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766 (10phaultfinder) 03NEW [18:24:22] (03CR) 10BCornwall: [C:03+1] varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [18:25:06] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9935568 (10cmooney) I may have spoken too soon when I said things were working fine. It seems in codfw since the change we are only get... [18:27:17] (03CR) 10Ssingh: [C:03+2] varnish: redirect donate.wm.org Special:LandingPage to / [puppet] - 10https://gerrit.wikimedia.org/r/1050657 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [18:29:10] 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#9935587 (10Dzahn) [18:29:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:30:56] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368767 (10phaultfinder) 03NEW [18:32:52] (03CR) 10Scott French: "Thanks so much for the review! Apologies for losing track of this patch. I've rebased to get back up to date, and it looks like this shoul" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [18:34:48] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9935618 (10Ladsgroup) I would like to monitor the databases when the dumps sta... [18:37:44] (03CR) 10Scott French: [C:03+1] deployment_server: mwscript-cleanup fixes [puppet] - 10https://gerrit.wikimedia.org/r/1050661 (owner: 10RLazarus) [18:42:27] (03CR) 10RLazarus: [C:03+2] deployment_server: mwscript-cleanup fixes [puppet] - 10https://gerrit.wikimedia.org/r/1050661 (owner: 10RLazarus) [18:43:08] (03PS1) 10Ssingh: varnish: redirect donate.m.wikimedia.org temporarily after mobile_ [puppet] - 10https://gerrit.wikimedia.org/r/1050665 [18:43:32] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9935636 (10xcollazo) Great, tentatively I've scheduled time with @BTullis on W... [18:44:21] (03CR) 10Ssingh: [C:03+2] varnish: redirect donate.m.wikimedia.org temporarily after mobile_ [puppet] - 10https://gerrit.wikimedia.org/r/1050665 (owner: 10Ssingh) [18:45:52] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9935645 (10xcollazo) >>! In T368098#9935618, @Ladsgroup wrote: > I would like... [18:54:23] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368099#9935664 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate T362033 [18:54:39] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368564#9935682 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate T362033 [18:56:13] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368767#9935733 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated psu [18:58:43] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9935754 (10Jclark-ctr) a:03VRiley-WMF Did mgmt ip address get update for any maintenance you preformed? [19:02:08] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9935782 (10Jclark-ctr) a:03BTullis [19:02:37] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9935780 (10VRiley-WMF) Not that I'm aware of. I used the same cable for everything. @Eevans would you happen to know if the IP address changed on this? [19:09:15] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:17] (03PS1) 10Ssingh: varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669 [19:14:22] (03CR) 10BCornwall: [C:03+1] varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669 (owner: 10Ssingh) [19:15:22] (03PS2) 10Ssingh: varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669 (https://phabricator.wikimedia.org/T368645) [19:17:47] (03PS3) 10Ssingh: varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669 (https://phabricator.wikimedia.org/T368645) [19:19:00] (03CR) 10BBlack: [C:03+1] varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [19:19:30] (03CR) 10Ssingh: [C:03+2] varnish: completely rewrite donate.m.wikimedia.org to donate.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1050669 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [19:30:35] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [19:31:32] !log sudo cumin -b10 "A:cp-text" "run-puppet-agent --enable 'dont enable'": T368645 [19:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:37] !log jclark@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [19:31:42] T368645: Google search results pointing to nonexistent https://donate.m.wikimedia.org/ - https://phabricator.wikimedia.org/T368645 [19:31:48] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [19:33:03] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1028 [19:34:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1028 [19:35:04] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1029 [19:36:06] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002" [19:36:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1029 [19:37:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002" [19:37:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:46:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbproxy1028.eqiad.wmnet with OS bookworm [19:46:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbproxy1029.eqiad.wmnet with OS bookworm [19:46:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9935903 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbproxy1028.eqiad.wmnet with OS bookworm [19:46:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9935904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbproxy1029.eqiad.wmnet with OS bookworm [19:48:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9935918 (10Jclark-ctr) [19:51:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9935933 (10Jclark-ctr) a:03Jclark-ctr [19:54:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:57:07] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1029.eqiad.wmnet with reason: host reimage [19:57:10] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1028.eqiad.wmnet with reason: host reimage [19:59:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:37] ^ lists1001 is not in production and we tried to disable the monitoring before but it's back.. [20:00:57] downtiming and them out [20:01:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1029.eqiad.wmnet with reason: host reimage [20:01:27] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on lists1001.wikimedia.org with reason: decomed [20:01:39] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on lists1001.wikimedia.org with reason: decomed [20:02:10] no alerts during this shift, cya [20:02:18] <3 [20:04:02] (03PS1) 10Jdlrobson: Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366524) [20:04:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1028.eqiad.wmnet with reason: host reimage [20:06:51] RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:15:24] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:18:17] (03PS1) 10Ssingh: varnish: selectively redirect donate.m.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/1050672 (https://phabricator.wikimedia.org/T368645) [20:18:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:18:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1029.eqiad.wmnet with OS bookworm [20:18:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9936083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbproxy1029.eqiad.wmnet with OS bookworm completed: - dbproxy1029 (**PA... [20:19:18] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:20:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:20:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1028.eqiad.wmnet with OS bookworm [20:20:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9936088 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbproxy1028.eqiad.wmnet with OS bookworm completed: - dbproxy1028 (**PA... [20:20:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9936089 (10Jclark-ctr) [20:21:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9936092 (10Jclark-ctr) 05Open→03Resolved [20:29:44] !log sudo cumin "A:cp-text" 'disable-puppet "CR 1050672"' [20:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:41] (03CR) 10Ssingh: [C:03+2] varnish: selectively redirect donate.m.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/1050672 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [20:31:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:36:23] (03PS18) 10Gergő Tisza: Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [20:38:32] (03CR) 10Gergő Tisza: "PS 18: do not set $wgCentralAuthSsoUrlPrefix to false when on the shared domain to communicate that fact. It makes local testing more comp" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [20:40:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:44:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:51:46] (03PS2) 10Jdlrobson: [July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151) [20:54:04] (03PS2) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) [20:54:43] (03CR) 10CI reject: [V:04-1] [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson) [21:00:07] (03PS3) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) [21:00:15] (03PS1) 10Ssingh: varnish: make trailing / optional for donate.m redirect [puppet] - 10https://gerrit.wikimedia.org/r/1050676 [21:00:40] (03PS4) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) [21:01:41] (03CR) 10Ssingh: [C:03+2] varnish: make trailing / optional for donate.m redirect [puppet] - 10https://gerrit.wikimedia.org/r/1050676 (owner: 10Ssingh) [21:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:05:39] !log sudo cumin -b11 "A:cp-text" 'run-puppet-agent' [21:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9936203 (10Andrew) a:05Andrew→03None [21:11:27] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:14:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:15:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:16:40] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1039 - jclark@cumin1002" [21:16:54] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1039.mgmt.eqiad.wmnet with reboot policy FORCED [21:17:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1039 - jclark@cumin1002" [21:17:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:17:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1040.mgmt.eqiad.wmnet with reboot policy FORCED [21:18:05] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:18:25] (03PS2) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366524) [21:18:25] (03PS3) 10Jdlrobson: [July 15th] Deploy dark mode to all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) [21:18:26] (03PS3) 10Jdlrobson: [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) [21:19:18] (03CR) 10CI reject: [V:04-1] [July 16th] Enable dark mode for logged out users (tier 1 and tier 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [21:20:38] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1039 - jclark@cumin1002" [21:21:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1039 - jclark@cumin1002" [21:21:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:22:37] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1041.mgmt.eqiad.wmnet with reboot policy FORCED [21:27:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936269 (10Jclark-ctr) cloudcephosd1039 2nd cable serial#20220008 port 1 cloudcephosd1040 2nd cable serial#20220043 port 5 cloudcephosd1041 2nd cable seria... [21:30:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1040.mgmt.eqiad.wmnet with reboot policy FORCED [21:33:56] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1039.mgmt.eqiad.wmnet with reboot policy FORCED [21:34:18] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:34:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1041.mgmt.eqiad.wmnet with reboot policy FORCED [21:35:47] !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:36:45] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:38:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:38:55] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:41:16] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1040 - jclark@cumin1002" [21:41:32] (03PS1) 10Clare Ming: Add test streams for Metrics Platform app + web base instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050678 (https://phabricator.wikimedia.org/T366949) [21:41:46] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye [21:41:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1041.eqiad.wmnet with OS bullseye [21:41:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye [21:41:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936299 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye [21:42:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1040 - jclark@cumin1002" [21:42:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:42:28] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye [21:42:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936301 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye [21:44:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:45:06] (03PS2) 10Clare Ming: Add test streams for Metrics Platform app + web base instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050678 (https://phabricator.wikimedia.org/T366949) [21:49:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:58:27] 06SRE, 10DNS, 10fundraising-tech-ops, 06Traffic, 13Patch-For-Review: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9936334 (10Dwisehaupt) Adding @AKanji-WMF on this to coordinate with Major Gifts for the benefactors site. Anil: The previous tasks associated with this are: T10... [22:03:37] PROBLEM - Disk space on an-web1001 is CRITICAL: DISK CRITICAL - free space: /srv 27825 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops [22:09:53] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1040'] [22:10:07] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1040'] [22:13:46] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1041'] [22:14:02] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1041'] [22:15:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:16:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936398 (10Jclark-ctr) [22:17:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936403 (10Jclark-ctr) a:03Jclark-ctr [22:24:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:32:07] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [22:36:57] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9936422 (10Ottomata) Yes, @Sharvaniharan will need analytics-privatedata-users access for that. Approved! [22:50:35] !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudcephosd1039.eqiad.wmnet [22:52:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9936434 (10Papaul) [23:09:15] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:16:33] !log removing 1 image for legal compliance [23:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:09] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudcephosd1039.eqiad.wmnet [23:18:17] !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudcephosd1039.eqiad.wmnet [23:21:51] !log removing 1 image for legal compliance [23:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:39] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudcephosd1039.eqiad.wmnet [23:23:50] !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudcephosd1039.eqiad.wmnet [23:28:48] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:29:25] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudcephosd1039.eqiad.wmnet [23:31:28] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update ip address for cloudcephosd1039 - pt1979@cumin2002" [23:32:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update ip address for cloudcephosd1039 - pt1979@cumin2002" [23:32:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:32:45] !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudcephosd1039.eqiad.wmnet [23:33:58] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudcephosd1039.eqiad.wmnet [23:34:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050680 [23:38:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050680 (owner: 10TrainBranchBot) [23:42:37] !log removing 1 image for legal compliance [23:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:50:36] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [23:54:10] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002" [23:55:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002" [23:55:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:56:10] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [23:57:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:59:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable