[00:00:01] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1025915 (owner: 10TrainBranchBot)
[00:04:29] <mutante>	 win 46
[00:06:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T361627)', diff saved to https://phabricator.wikimedia.org/P61788 and previous config saved to /var/cache/conftool/dbconfig/20240503-000602-marostegui.json
[00:06:04] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1249.eqiad.wmnet with reason: Maintenance
[00:06:06] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[00:06:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1249.eqiad.wmnet with reason: Maintenance
[00:06:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T361627)', diff saved to https://phabricator.wikimedia.org/P61789 and previous config saved to /var/cache/conftool/dbconfig/20240503-000614-marostegui.json
[00:07:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:10:32] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[00:11:22] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Swift
[00:18:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T361627)', diff saved to https://phabricator.wikimedia.org/P61790 and previous config saved to /var/cache/conftool/dbconfig/20240503-001805-marostegui.json
[00:18:12] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[00:33:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P61791 and previous config saved to /var/cache/conftool/dbconfig/20240503-003313-marostegui.json
[00:48:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P61792 and previous config saved to /var/cache/conftool/dbconfig/20240503-004821-marostegui.json
[01:03:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T361627)', diff saved to https://phabricator.wikimedia.org/P61793 and previous config saved to /var/cache/conftool/dbconfig/20240503-010330-marostegui.json
[01:03:33] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[01:03:33] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[01:03:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[01:04:05] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Apply updated JDK 8 - eevans@cumin1002
[01:10:42] <jinxer-wm>	 FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[01:17:16] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs7003 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[01:20:08] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs7001 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[02:20:32] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 49545600 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:21:32] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:38:54] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:58:54] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:02:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:27:01] <wikibugs>	 (03CR) 10Jdlrobson: Revert "Update wgVectorClientPrefs to wgVectorAppearance" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026630 (owner: 10Jdrewniak)
[03:28:48] <wikibugs>	 10ops-eqiad, 06SRE: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060 (10ops-monitoring-bot) 03NEW
[03:48:26] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[03:54:27] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364061 (10phaultfinder) 03NEW
[03:55:29] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9765807 (10phaultfinder)
[03:59:24] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9765808 (10phaultfinder)
[04:00:26] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9765809 (10phaultfinder)
[04:44:15] <wikibugs>	 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9765835 (10Papaul) @Jclark-ctr @VRiley-WMF when the task was auto generated, it shows that disk sdg1 failed see in task description line below (F)  md1 : active raid10 sdh1[4]**// sdg1[2](F)//** sdf1[1] sde1[0] Toda...
[04:46:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1238.eqiad.wmnet with reason: Maintenance
[04:46:44] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1238.eqiad.wmnet with reason: Maintenance
[04:49:18] <wikibugs>	 (03PS1) 10Marostegui: es1039: Not in setup anymore. [puppet] - 10https://gerrit.wikimedia.org/r/1026703
[04:50:48] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1039: Not in setup anymore. [puppet] - 10https://gerrit.wikimedia.org/r/1026703 (owner: 10Marostegui)
[04:58:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2098.codfw.wmnet with reason: Maintenance
[04:59:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2098.codfw.wmnet with reason: Maintenance
[05:02:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:09:41] <wikibugs>	 (03PS1) 10Marostegui: db1214: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026704
[05:09:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1214', diff saved to https://phabricator.wikimedia.org/P61794 and previous config saved to /var/cache/conftool/dbconfig/20240503-050947-root.json
[05:10:42] <jinxer-wm>	 FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[05:11:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1214.eqiad.wmnet with OS bookworm
[05:14:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1214: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026704 (owner: 10Marostegui)
[05:24:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2152.codfw.wmnet with reason: Maintenance
[05:24:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1214.eqiad.wmnet with reason: host reimage
[05:24:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2152.codfw.wmnet with reason: Maintenance
[05:24:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T361627)', diff saved to https://phabricator.wikimedia.org/P61795 and previous config saved to /var/cache/conftool/dbconfig/20240503-052430-marostegui.json
[05:24:33] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[05:27:02] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1214.eqiad.wmnet with reason: host reimage
[05:35:14] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:35:52] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:36:04] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1214: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026640
[05:40:42] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 14 Jun 2024 01:28:50 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:41:10] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:45:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T361627)', diff saved to https://phabricator.wikimedia.org/P61796 and previous config saved to /var/cache/conftool/dbconfig/20240503-054502-marostegui.json
[05:45:08] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[05:47:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1214.eqiad.wmnet with OS bookworm
[05:48:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61797 and previous config saved to /var/cache/conftool/dbconfig/20240503-054818-root.json
[05:48:21] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1214: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026640 (owner: 10Marostegui)
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240503T0600)
[06:00:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P61798 and previous config saved to /var/cache/conftool/dbconfig/20240503-060010-marostegui.json
[06:03:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61799 and previous config saved to /var/cache/conftool/dbconfig/20240503-060324-root.json
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:15:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P61800 and previous config saved to /var/cache/conftool/dbconfig/20240503-061517-marostegui.json
[06:18:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61801 and previous config saved to /var/cache/conftool/dbconfig/20240503-061830-root.json
[06:25:04] <wikibugs>	 (03PS1) 10Slyngshede: Add cloudtestidm [dns] - 10https://gerrit.wikimedia.org/r/1026711 (https://phabricator.wikimedia.org/T362128)
[06:25:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add cloudtestidm [dns] - 10https://gerrit.wikimedia.org/r/1026711 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede)
[06:27:22] <wikibugs>	 (03PS2) 10Slyngshede: Add cloudtestidm [dns] - 10https://gerrit.wikimedia.org/r/1026711 (https://phabricator.wikimedia.org/T362128)
[06:30:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T361627)', diff saved to https://phabricator.wikimedia.org/P61802 and previous config saved to /var/cache/conftool/dbconfig/20240503-063025-marostegui.json
[06:30:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2154.codfw.wmnet with reason: Maintenance
[06:30:29] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[06:30:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2154.codfw.wmnet with reason: Maintenance
[06:30:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T361627)', diff saved to https://phabricator.wikimedia.org/P61803 and previous config saved to /var/cache/conftool/dbconfig/20240503-063048-marostegui.json
[06:33:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61804 and previous config saved to /var/cache/conftool/dbconfig/20240503-063336-root.json
[06:41:34] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1231 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1025916 (https://phabricator.wikimedia.org/T364067)
[06:41:38] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1025917 (https://phabricator.wikimedia.org/T364067)
[06:47:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti7002 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026715 (https://phabricator.wikimedia.org/T363978)
[06:47:34] <wikibugs>	 (03PS1) 10Jdlrobson: Enable night mode on beta cluster desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026716 (https://phabricator.wikimedia.org/T354889)
[06:48:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Make ganeti7002 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026715 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff)
[06:48:40] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-File-management, 06serviceops, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155#9765986 (10C.Suthorn) > Loading the original file or the 800px thumb would probably be non-ideal, partic...
[06:48:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61805 and previous config saved to /var/cache/conftool/dbconfig/20240503-064842-root.json
[06:49:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users  for  linafaridwmde - https://phabricator.wikimedia.org/T364068 (10Lina_Farid_WMDE) 03NEW
[06:53:13] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users  for  linafaridwmde - https://phabricator.wikimedia.org/T364068#9766013 (10Lina_Farid_WMDE)
[06:55:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T361627)', diff saved to https://phabricator.wikimedia.org/P61806 and previous config saved to /var/cache/conftool/dbconfig/20240503-065547-marostegui.json
[06:55:56] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[06:58:54] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240503T0700)
[07:03:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61807 and previous config saved to /var/cache/conftool/dbconfig/20240503-070347-root.json
[07:10:54] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[07:10:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P61808 and previous config saved to /var/cache/conftool/dbconfig/20240503-071057-marostegui.json
[07:11:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:13:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti7004 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026720 (https://phabricator.wikimedia.org/T363978)
[07:14:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Make ganeti7004 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026720 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff)
[07:18:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61809 and previous config saved to /var/cache/conftool/dbconfig/20240503-071853-root.json
[07:24:51] <wikibugs>	 (03PS1) 10Marostegui: es1032: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1026722
[07:25:08] <wikibugs>	 (03PS2) 10Jdlrobson: Enable night mode on beta cluster desktop for all page views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026716 (https://phabricator.wikimedia.org/T354889)
[07:25:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add magru02 to netbox config [puppet] - 10https://gerrit.wikimedia.org/r/1026587 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff)
[07:26:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P61810 and previous config saved to /var/cache/conftool/dbconfig/20240503-072604-marostegui.json
[07:27:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1032: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1026722 (owner: 10Marostegui)
[07:27:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Add install7001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1026729 (https://phabricator.wikimedia.org/T364016)
[07:32:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7004.magru.wmnet to cluster magru02 and group B4
[07:32:37] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] "IIRC we need to pull this patch on the deployment server to avoid surprises during the next deployment windows on Monday. Once that's clar" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026716 (https://phabricator.wikimedia.org/T354889) (owner: 10Jdlrobson)
[07:32:39] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Enable night mode on beta cluster desktop for all page views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026716 (https://phabricator.wikimedia.org/T354889) (owner: 10Jdlrobson)
[07:32:49] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'Arnadh2011' 'User435211' # T363654
[07:32:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:53] <stashbot>	 T363654: Stuck global rename [119980] - https://phabricator.wikimedia.org/T363654
[07:33:02] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7004.magru.wmnet to cluster magru02 and group B4
[07:33:42] <wikibugs>	 (03Merged) 10jenkins-bot: Enable night mode on beta cluster desktop for all page views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026716 (https://phabricator.wikimedia.org/T354889) (owner: 10Jdlrobson)
[07:34:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add install7001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1026729 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff)
[07:35:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Add bast7001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1026786 (https://phabricator.wikimedia.org/T364016)
[07:37:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add bast7001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1026786 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff)
[07:39:09] <wikibugs>	 (03PS1) 10Muehlenhoff: preseed: Extend globbing for bast and prometheus to cover magru [puppet] - 10https://gerrit.wikimedia.org/r/1026787 (https://phabricator.wikimedia.org/T364016)
[07:41:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T361627)', diff saved to https://phabricator.wikimedia.org/P61811 and previous config saved to /var/cache/conftool/dbconfig/20240503-074112-marostegui.json
[07:41:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2161.codfw.wmnet with reason: Maintenance
[07:41:15] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[07:41:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2161.codfw.wmnet with reason: Maintenance
[07:41:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T361627)', diff saved to https://phabricator.wikimedia.org/P61812 and previous config saved to /var/cache/conftool/dbconfig/20240503-074135-marostegui.json
[07:43:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] preseed: Extend globbing for bast and prometheus to cover magru [puppet] - 10https://gerrit.wikimedia.org/r/1026787 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff)
[07:48:26] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[07:49:03] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026788 (https://phabricator.wikimedia.org/T362529)
[07:52:11] <wikibugs>	 06SRE, 10SRE-Access-Requests, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users  for  linafaridwmde - https://phabricator.wikimedia.org/T364068#9766190 (10awight)
[07:53:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast7001.wikimedia.org
[07:53:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[07:57:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7001.wikimedia.org - jmm@cumin2002"
[07:59:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7001.wikimedia.org - jmm@cumin2002"
[07:59:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:59:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast7001.wikimedia.org on all recursors
[07:59:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7001.wikimedia.org on all recursors
[07:59:47] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9766206 (10phaultfinder)
[08:00:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7001.wikimedia.org - jmm@cumin2002"
[08:00:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users  for  linafaridwmde - https://phabricator.wikimedia.org/T364068#9766207 (10Lena_WMDE) As the manager of @Lina_Farid_WMDE I approve the request.
[08:00:42] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9766208 (10phaultfinder)
[08:00:44] <wikibugs>	 (03PS1) 10Slyngshede: P:trafficserver::backend add cloudtestidm [puppet] - 10https://gerrit.wikimedia.org/r/1026790 (https://phabricator.wikimedia.org/T362128)
[08:00:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7001.wikimedia.org - jmm@cumin2002"
[08:05:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast7001.wikimedia.org with OS bookworm
[08:06:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T361627)', diff saved to https://phabricator.wikimedia.org/P61813 and previous config saved to /var/cache/conftool/dbconfig/20240503-080649-marostegui.json
[08:06:52] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[08:11:54] <moritzm>	 !log installing emacs security updates
[08:11:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:11] <wikibugs>	 (03CR) 10Slyngshede: "There's quite a bit of code here, but some of it allows us to remove existing code in a followup patch." [software/bitu] - 10https://gerrit.wikimedia.org/r/1026458 (owner: 10Slyngshede)
[08:17:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1026711 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede)
[08:20:07] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[08:21:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P61814 and previous config saved to /var/cache/conftool/dbconfig/20240503-082156-marostegui.json
[08:24:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1026790 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede)
[08:26:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1026692 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[08:28:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1026693 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[08:28:43] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Add cloudtestidm [dns] - 10https://gerrit.wikimedia.org/r/1026711 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede)
[08:30:13] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[08:32:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1026622 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn)
[08:33:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast7001.wikimedia.org with reason: host reimage
[08:34:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete certs for ldap-corp [puppet] - 10https://gerrit.wikimedia.org/r/1026797 (https://phabricator.wikimedia.org/T323820)
[08:36:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7001.wikimedia.org with reason: host reimage
[08:37:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P61815 and previous config saved to /var/cache/conftool/dbconfig/20240503-083703-marostegui.json
[08:39:15] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-File-management, 06serviceops, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155#9766334 (10Bawolff) I think if we did deliver the wrong thumbsize, it only makes sense to deliver one la...
[08:41:50] <wikibugs>	 (03PS1) 10Jelto: phabricator: increase phabricator page delay to 4m [puppet] - 10https://gerrit.wikimedia.org/r/1026801 (https://phabricator.wikimedia.org/T362401)
[08:48:42] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1026801 (https://phabricator.wikimedia.org/T362401) (owner: 10Jelto)
[08:48:45] <XioNoX>	 !log restart turnilo
[08:48:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T361627)', diff saved to https://phabricator.wikimedia.org/P61816 and previous config saved to /var/cache/conftool/dbconfig/20240503-085211-marostegui.json
[08:52:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2162.codfw.wmnet with reason: Maintenance
[08:52:14] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[08:52:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7001.wikimedia.org with OS bookworm
[08:52:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7001.wikimedia.org
[08:52:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2162.codfw.wmnet with reason: Maintenance
[08:52:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T361627)', diff saved to https://phabricator.wikimedia.org/P61817 and previous config saved to /var/cache/conftool/dbconfig/20240503-085234-marostegui.json
[08:56:20] <wikibugs>	 (03CR) 10Jelto: [C:03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1026607 (https://phabricator.wikimedia.org/T364013) (owner: 10Ahmon Dancy)
[08:56:34] <wikibugs>	 (03PS1) 10Muehlenhoff: elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439)
[08:56:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff)
[08:58:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Stop supporting sslcert in Profile::Pki::Provider type [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750)
[09:02:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:02:59] <wikibugs>	 (03PS2) 10Muehlenhoff: elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439)
[09:03:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff)
[09:03:49] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1026699 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney)
[09:04:31] <wikibugs>	 (03PS3) 10Muehlenhoff: elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439)
[09:07:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1026806
[09:09:12] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff)
[09:10:42] <jinxer-wm>	 FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[09:11:48] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[09:17:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T361627)', diff saved to https://phabricator.wikimedia.org/P61818 and previous config saved to /var/cache/conftool/dbconfig/20240503-091750-marostegui.json
[09:17:53] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[09:20:46] <wikibugs>	 (03PS1) 10Ayounsi: magru: alert on Transit BGP sessions [puppet] - 10https://gerrit.wikimedia.org/r/1026808 (https://phabricator.wikimedia.org/T362421)
[09:22:52] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] magru: alert on Transit BGP sessions [puppet] - 10https://gerrit.wikimedia.org/r/1026808 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi)
[09:26:03] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:26:09] <icinga-wm>	 RECOVERY - Host ps1-b8-codfw is UP: PING WARNING - Packet loss = 66%, RTA = 0.21 ms
[09:26:59] <icinga-wm>	 PROBLEM - ps1-b8-codfw-infeed-load-tower-B-phase-X on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:26:59] <icinga-wm>	 PROBLEM - ps1-b8-codfw-infeed-load-tower-A-phase-Y on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:26:59] <icinga-wm>	 PROBLEM - ps1-b8-codfw-infeed-load-tower-B-phase-Y on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:26:59] <icinga-wm>	 PROBLEM - ps1-b8-codfw-infeed-load-tower-A-phase-X on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:26:59] <icinga-wm>	 PROBLEM - ps1-b8-codfw-infeed-load-tower-B-phase-Z on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:27:00] <icinga-wm>	 PROBLEM - ps1-b8-codfw-infeed-load-tower-A-phase-Z on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:31:44] <icinga-wm>	 PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv6: Idle - Telxius, AS12956/IPv4: Idle - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:32:32] <icinga-wm>	 PROBLEM - Host ps1-b8-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[09:32:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P61820 and previous config saved to /var/cache/conftool/dbconfig/20240503-093257-marostegui.json
[09:40:22] <wikibugs>	 (03CR) 10TheDJ: "Scheduled this for 9th of may puppet request window." [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff)
[09:48:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P61821 and previous config saved to /var/cache/conftool/dbconfig/20240503-094805-marostegui.json
[09:49:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092 (10cmooney) 03NEW p:05Triage→03Medium
[09:50:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9766638 (10cmooney)
[09:55:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9766653 (10ayounsi) Both Junos 22.2R3-Sx and Junos 22.4R3 are latest recommended. fyi, I went with 22.4R3 in magru.
[09:57:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:02:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:03:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T361627)', diff saved to https://phabricator.wikimedia.org/P61822 and previous config saved to /var/cache/conftool/dbconfig/20240503-100313-marostegui.json
[10:03:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2163.codfw.wmnet with reason: Maintenance
[10:03:16] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[10:03:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2163.codfw.wmnet with reason: Maintenance
[10:03:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T361627)', diff saved to https://phabricator.wikimedia.org/P61823 and previous config saved to /var/cache/conftool/dbconfig/20240503-100335-marostegui.json
[10:12:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:13:02] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[10:14:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:14:25] <wikibugs>	 (03PS1) 10Btullis: Add a ceph client for the dse-k8s container storage interface [puppet] - 10https://gerrit.wikimedia.org/r/1026819 (https://phabricator.wikimedia.org/T327259)
[10:15:41] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2240/console" [puppet] - 10https://gerrit.wikimedia.org/r/1026819 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[10:15:59] <moritzm>	 !log installing Java 17 security updates on idp-test
[10:16:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:30] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095 (10cmooney) 03NEW p:05Triage→03Medium
[10:16:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9766721 (10cmooney)
[10:16:53] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9766720 (10cmooney)
[10:19:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:23:16] <icinga-wm>	 RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.45 ms
[10:23:42] <icinga-wm>	 PROBLEM - ps1-a7-codfw-infeed-load-tower-B-phase-Y on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:23:42] <icinga-wm>	 PROBLEM - ps1-a7-codfw-infeed-load-tower-B-phase-Z on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:23:42] <icinga-wm>	 PROBLEM - ps1-a7-codfw-infeed-load-tower-B-phase-X on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:23:42] <icinga-wm>	 PROBLEM - ps1-a7-codfw-infeed-load-tower-A-phase-X on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:23:42] <icinga-wm>	 PROBLEM - ps1-a7-codfw-infeed-load-tower-A-phase-Y on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:23:43] <icinga-wm>	 PROBLEM - ps1-a7-codfw-infeed-load-tower-A-phase-Z on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:24:19] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:25:35] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097 (10cmooney) 03NEW p:05Triage→03Medium
[10:25:48] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9766766 (10cmooney)
[10:25:49] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766767 (10cmooney)
[10:27:36] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on lsw1-a1-codfw,lsw1-a1-codfw IPv6,lsw1-a1-codfw.mgmt with reason: device being decommed and renamed, downtiming as a precaution first
[10:27:51] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on lsw1-a1-codfw,lsw1-a1-codfw IPv6,lsw1-a1-codfw.mgmt with reason: device being decommed and renamed, downtiming as a precaution first
[10:28:05] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766769 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b27eb80b-98ee-43fb-8026-b02b3e00b5d4) set by cmooney@cumin1002 for 14 days, 0:00:00 on 3 host(s) and their...
[10:28:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T361627)', diff saved to https://phabricator.wikimedia.org/P61825 and previous config saved to /var/cache/conftool/dbconfig/20240503-102809-marostegui.json
[10:28:13] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[10:29:31] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove entries for lsw1-a1-codfw and private1-a1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1026821 (https://phabricator.wikimedia.org/T364097)
[10:29:40] <icinga-wm>	 PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[10:30:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Druid: overlord/historical/middlemanager: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1026822
[10:30:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Druid: overlord/historical/middlemanager: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1026822 (owner: 10Muehlenhoff)
[10:31:24] <wikibugs>	 (03CR) 10Majavah: [C:03+2] wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 (owner: 10Majavah)
[10:32:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add bast7001 - jmm@cumin2002 - T364016"
[10:32:06] <stashbot>	 T364016: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016
[10:33:18] <wikibugs>	 (03PS3) 10Majavah: wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821
[10:33:18] <wikibugs>	 (03PS4) 10Majavah: wikireplicas: add-wiki: use runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822
[10:33:18] <wikibugs>	 (03PS4) 10Majavah: wikireplicas: add-wiki: stop sourcing novaenv.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/1023823
[10:33:35] <wikibugs>	 (03CR) 10Majavah: wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 (owner: 10Majavah)
[10:33:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add bast7001 - jmm@cumin2002 - T364016"
[10:34:49] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove lsw1-a1-codfw from production [homer/public] - 10https://gerrit.wikimedia.org/r/1026823 (https://phabricator.wikimedia.org/T364097)
[10:35:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Make bast7001 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/1026824 (https://phabricator.wikimedia.org/T364016)
[10:35:36] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766810 (10cmooney) Device has been removed from LiberNMS now.  I also downtimed it for 2 weeks just in case I mess up the order of anything.
[10:35:56] <wikibugs>	 (03CR) 10Majavah: [C:03+2] wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 (owner: 10Majavah)
[10:38:08] <wikibugs>	 (03PS1) 10Marostegui: db1203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026846
[10:38:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1203', diff saved to https://phabricator.wikimedia.org/P61826 and previous config saved to /var/cache/conftool/dbconfig/20240503-103814-root.json
[10:38:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026846 (owner: 10Marostegui)
[10:39:16] <wikibugs>	 (03CR) 10Majavah: [C:03+2] wikireplicas: add-wiki: use runtime_description (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822 (owner: 10Majavah)
[10:39:26] <wikibugs>	 (03CR) 10Majavah: [C:03+2] wikireplicas: add-wiki: stop sourcing novaenv.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/1023823 (owner: 10Majavah)
[10:39:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1203.eqiad.wmnet with OS bookworm
[10:40:09] <wikibugs>	 (03Merged) 10jenkins-bot: wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 (owner: 10Majavah)
[10:40:51] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Make bast7001 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/1026824 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff)
[10:41:18] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Remove lsw1-a1-codfw from production [homer/public] - 10https://gerrit.wikimedia.org/r/1026823 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[10:41:34] <wikibugs>	 06SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9766840 (10MatthewVernon) I think I have two questions:    # Where is it defined what should and shouldn't get its own intermediate? (e.g. I see cassandra has one)   # Is ther...
[10:41:40] <icinga-wm>	 PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on doh7001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear, ssbd} https://wikitech.wikimedia.org/wiki/Microcode
[10:42:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Make bast7001 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/1026824 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff)
[10:42:41] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove lsw1-a1-codfw from EVPN RR cluster config [homer/public] - 10https://gerrit.wikimedia.org/r/1026847
[10:43:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P61827 and previous config saved to /var/cache/conftool/dbconfig/20240503-104317-marostegui.json
[10:43:37] <wikibugs>	 (03Merged) 10jenkins-bot: wikireplicas: add-wiki: use runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822 (owner: 10Majavah)
[10:43:42] <wikibugs>	 (03Merged) 10jenkins-bot: wikireplicas: add-wiki: stop sourcing novaenv.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/1023823 (owner: 10Majavah)
[10:44:52] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766846 (10cmooney)
[10:46:06] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add a ceph client for the dse-k8s container storage interface [puppet] - 10https://gerrit.wikimedia.org/r/1026819 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[10:46:37] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1203: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026835
[10:47:06] <wikibugs>	 (03PS2) 10Muehlenhoff: Druid: overlord/historical/middlemanager: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1026822
[10:47:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Druid: overlord/historical/middlemanager: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1026822 (owner: 10Muehlenhoff)
[10:50:12] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.ganeti.makevm for new host durum7002.magru.wmnet
[10:50:13] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.netbox
[10:50:22] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766856 (10cmooney)
[10:50:58] <wikibugs>	 (03PS2) 10Cathal Mooney: Remove lsw1-a1-codfw from EVPN RR cluster config [homer/public] - 10https://gerrit.wikimedia.org/r/1026847 (https://phabricator.wikimedia.org/T364097)
[10:51:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow7001.magru.wmnet
[10:52:12] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7002.magru.wmnet - sukhe@cumin1002"
[10:52:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1203.eqiad.wmnet with reason: host reimage
[10:53:05] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7002.magru.wmnet - sukhe@cumin1002"
[10:53:05] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:53:05] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache durum7002.magru.wmnet on all recursors
[10:53:08] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7002.magru.wmnet on all recursors
[10:53:29] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7002.magru.wmnet - sukhe@cumin1002"
[10:53:59] <wikibugs>	 (03PS3) 10Muehlenhoff: Druid: historical/middlemanager: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1026822
[10:54:22] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7002.magru.wmnet - sukhe@cumin1002"
[10:55:14] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1203.eqiad.wmnet with reason: host reimage
[10:56:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow7001.magru.wmnet
[10:57:52] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026822 (owner: 10Muehlenhoff)
[10:58:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240503-105824-marostegui.json
[10:58:47] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum7002.magru.wmnet with OS bookworm
[10:58:54] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:00:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir7001.magru.wmnet
[11:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240503T0700)
[11:00:05] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: It is that lovely time of the day again! You are hereby commanded to deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240503T1100).
[11:02:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:04:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir7001.magru.wmnet
[11:05:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh7001.wikimedia.org
[11:06:56] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.ganeti.makevm for new host doh7002.wikimedia.org
[11:06:57] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.netbox
[11:07:26] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:08:54] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7002.wikimedia.org - sukhe@cumin1002"
[11:09:09] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good, thanks! Checked existing usage and compared it against Netbox info." [puppet] - 10https://gerrit.wikimedia.org/r/1026821 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[11:09:15] <icinga-wm>	 RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on doh7001 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode
[11:09:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh7001.wikimedia.org
[11:09:42] <wikibugs>	 (03PS1) 10Majavah: wikireplicas: Sanitize logging_logindex target values [puppet] - 10https://gerrit.wikimedia.org/r/1026856 (https://phabricator.wikimedia.org/T363633)
[11:09:47] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7002.wikimedia.org - sukhe@cumin1002"
[11:09:47] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:09:47] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache doh7002.wikimedia.org on all recursors
[11:09:50] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7002.wikimedia.org on all recursors
[11:09:55] <wikibugs>	 06SRE-OnFire, 10Beta-Cluster-Infrastructure, 10logspam-watch, 10Sustainability (Incident Followup): (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors - https://phabricator.wikimedia.org/T315379#9766924 (10brennen)
[11:10:18] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7002.wikimedia.org - sukhe@cumin1002"
[11:10:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1203: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026835 (owner: 10Marostegui)
[11:11:21] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7002.wikimedia.org - sukhe@cumin1002"
[11:11:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61828 and previous config saved to /var/cache/conftool/dbconfig/20240503-111129-root.json
[11:11:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:11:41] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host doh7002.wikimedia.org with OS bookworm
[11:11:53] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1414 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:11:59] <wikibugs>	 (03PS1) 10Majavah: wikireplicas: update-views: Add filter option [cookbooks] - 10https://gerrit.wikimedia.org/r/1026857
[11:12:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum7001.magru.wmnet
[11:13:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Adjust IBGP route-reflector spine/leaf automation to support separate client clusters - https://phabricator.wikimedia.org/T364103 (10cmooney) 03NEW p:05Triage→03Medium
[11:13:20] <wikibugs>	 (03CR) 10Majavah: [C:03+2] wikireplicas: Sanitize logging_logindex target values [puppet] - 10https://gerrit.wikimedia.org/r/1026856 (https://phabricator.wikimedia.org/T363633) (owner: 10Majavah)
[11:13:37] <wikibugs>	 (03PS2) 10JMeybohm: New chart from scaffold: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026563 (https://phabricator.wikimedia.org/T362310)
[11:13:37] <wikibugs>	 (03PS1) 10JMeybohm: New version of base.certificates module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026859 (https://phabricator.wikimedia.org/T362310)
[11:13:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T361627)', diff saved to https://phabricator.wikimedia.org/P61829 and previous config saved to /var/cache/conftool/dbconfig/20240503-111337-marostegui.json
[11:13:40] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[11:13:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2164.codfw.wmnet with reason: Maintenance
[11:13:40] <wikibugs>	 (03PS1) 10JMeybohm: Make base.certificates compatible with chart modules and scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310)
[11:13:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2164.codfw.wmnet with reason: Maintenance
[11:13:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance
[11:14:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance
[11:14:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T361627)', diff saved to https://phabricator.wikimedia.org/P61830 and previous config saved to /var/cache/conftool/dbconfig/20240503-111415-marostegui.json
[11:15:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] New chart from scaffold: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026563 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[11:15:59] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.wikireplicas.update-views
[11:15:59] <logmsgbot>	 !log taavi@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=93)
[11:16:11] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.wikireplicas.update-views
[11:16:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1203.eqiad.wmnet with OS bookworm
[11:17:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum7001.magru.wmnet
[11:17:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:17:46] <wikibugs>	 (03PS3) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310)
[11:17:46] <wikibugs>	 (03CR) 10JMeybohm: "What do you have in mind here? I made the chart not very configurable on purpose currently. Any particular cases that you thing need extra" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[11:18:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw1452:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1452 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:19:46] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0)
[11:23:51] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7002.magru.wmnet with reason: host reimage
[11:24:57] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1019 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:26:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61831 and previous config saved to /var/cache/conftool/dbconfig/20240503-112635-root.json
[11:27:00] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Remove entries for lsw1-a1-codfw and private1-a1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1026821 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[11:27:05] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7002.magru.wmnet with reason: host reimage
[11:27:24] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Remove lsw1-a1-codfw from EVPN RR cluster config [homer/public] - 10https://gerrit.wikimedia.org/r/1026847 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[11:32:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:32:44] <wikibugs>	 (03PS1) 10Btullis: Make caps an optional parameter to the Ceph::Auth::ClientAuth type [puppet] - 10https://gerrit.wikimedia.org/r/1026867 (https://phabricator.wikimedia.org/T364105)
[11:34:07] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2242/console" [puppet] - 10https://gerrit.wikimedia.org/r/1026867 (https://phabricator.wikimedia.org/T364105) (owner: 10Btullis)
[11:36:05] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] wmp-laptop-sre: Add support for magru [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1025742 (owner: 10Muehlenhoff)
[11:38:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1026871
[11:38:40] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh7002.wikimedia.org with reason: host reimage
[11:39:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T361627)', diff saved to https://phabricator.wikimedia.org/P61832 and previous config saved to /var/cache/conftool/dbconfig/20240503-113924-marostegui.json
[11:39:28] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[11:41:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61833 and previous config saved to /var/cache/conftool/dbconfig/20240503-114141-root.json
[11:41:53] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1414 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:41:55] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7002.wikimedia.org with reason: host reimage
[11:42:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:42:41] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:43:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Add node20 production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026873 (https://phabricator.wikimedia.org/T362681)
[11:44:14] <topranks>	 !log Removing connections from ssw1-a1-codfw and ssw1-a8-codfw to lsw1-a1-codfw T364097
[11:44:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:17] <stashbot>	 T364097: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097
[11:44:27] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Set up Ganeti clusters in magru - https://phabricator.wikimedia.org/T363978#9767109 (10MoritzMuehlenhoff) 05Open→03Resolved The two clusters (magru01 and magru02) are setup and initial VMs have been created already.
[11:45:00] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Remove lsw1-a1-codfw from EVPN RR cluster config [homer/public] - 10https://gerrit.wikimedia.org/r/1026847 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[11:45:09] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7002.magru.wmnet with OS bookworm
[11:45:10] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7002.magru.wmnet
[11:45:40] <wikibugs>	 (03Merged) 10jenkins-bot: Remove lsw1-a1-codfw from EVPN RR cluster config [homer/public] - 10https://gerrit.wikimedia.org/r/1026847 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[11:47:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:48:26] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[11:51:44] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767123 (10cmooney)
[11:53:00] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1026871 (owner: 10Muehlenhoff)
[11:53:40] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[11:54:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P61834 and previous config saved to /var/cache/conftool/dbconfig/20240503-115431-marostegui.json
[11:54:57] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1019 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:55:44] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove lsw1-a1-codfw phyiscal link dns - cmooney@cumin1002"
[11:56:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61835 and previous config saved to /var/cache/conftool/dbconfig/20240503-115647-root.json
[11:57:05] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove lsw1-a1-codfw phyiscal link dns - cmooney@cumin1002"
[11:57:05] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:01:12] <moritzm>	 !log uploaded wmf-sre-laptop 0.5.10 to apt.wikimedia.org
[12:01:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:30] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7002.wikimedia.org with OS bookworm
[12:02:30] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7002.wikimedia.org
[12:04:24] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9767187 (10phaultfinder)
[12:05:32] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9767189 (10phaultfinder)
[12:06:00] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Remove entries for lsw1-a1-codfw and private1-a1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1026821 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[12:06:33] <topranks>	 !log removing entries for lsw1-a1-codfw switch and private1-a1-codfw vlan from puppet T364097
[12:06:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:36] <stashbot>	 T364097: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097
[12:07:50] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767210 (10cmooney)
[12:09:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P61837 and previous config saved to /var/cache/conftool/dbconfig/20240503-120938-marostegui.json
[12:11:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61838 and previous config saved to /var/cache/conftool/dbconfig/20240503-121153-root.json
[12:22:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:22:45] <wikibugs>	 06SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9767281 (10elukey) Hi! Trying to answer inline, Chris can chime in if I miss anything and/or if I write something totally off :)  >>! In T356412#9766840, @MatthewVernon wrote:...
[12:24:15] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:24:31] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:24:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T361627)', diff saved to https://phabricator.wikimedia.org/P61839 and previous config saved to /var/cache/conftool/dbconfig/20240503-122446-marostegui.json
[12:24:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2166.codfw.wmnet with reason: Maintenance
[12:24:50] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[12:25:03] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2166.codfw.wmnet with reason: Maintenance
[12:25:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T361627)', diff saved to https://phabricator.wikimedia.org/P61840 and previous config saved to /var/cache/conftool/dbconfig/20240503-122510-marostegui.json
[12:26:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61841 and previous config saved to /var/cache/conftool/dbconfig/20240503-122659-root.json
[12:27:26] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:29:58] <jinxer-wm>	 FIRING: [17x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:30:38] <jinxer-wm>	 FIRING: MXQueueNoMetrics: Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics
[12:30:38] <jinxer-wm>	 FIRING: FNMNotReported: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[12:30:38] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service mathoid:4001 has failed probes (http_mathoid_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:31:17] <Amir1>	 here
[12:31:36] <Amir1>	 cwhite: ?
[12:32:26] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:33:44] <Amir1>	 fabfur: herron ^
[12:33:54] <jinxer-wm>	 FIRING: [21x] ProbeDown: Service t-b-pki-01:443 has failed probes (http_PKI_aux_front_proxy_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#t-b-pki-01:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:33:54] <jinxer-wm>	 FIRING: [9x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:34:37] <Amir1>	 so many alerts, I don't know where it's starting
[12:35:38] <jinxer-wm>	 FIRING: [45x] ProbeDown: Service api-https:443 has failed probes (http_api-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:35:47] <jelto>	 probably a ganeti issue? looks quite random
[12:35:50] <jelto>	 hm 
[12:36:27] <Amir1>	 could be network as well
[12:36:37] <taavi>	 or new prometheus box in magru?
[12:37:00] <Amir1>	 or alert infra
[12:37:26] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:39:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install7001.wikimedia.org
[12:39:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:39:34] <Amir1>	 things are working when I test them, the graphs don't show a sign they are down but a lot of probes are failing
[12:42:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61842 and previous config saved to /var/cache/conftool/dbconfig/20240503-124204-root.json
[12:42:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:42:41] <jelto>	 our team dashboard for alertmanager also has quite a lot of random alerts and linting problems 
[12:43:37] <Amir1>	 this is unrelated https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&var-site=All&var-deployment=mw-parsoid&var-method=GET&var-code=200&var-handler=php&var-service=mediawiki&viewPanel=63
[12:44:49] <wikibugs>	 (03CR) 10Jforrester: "Should there be a -devel image too (with npm in it)?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026873 (https://phabricator.wikimedia.org/T362681) (owner: 10Muehlenhoff)
[12:46:50] <wikibugs>	 (03PS4) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310)
[12:47:12] <Amir1>	 herron: the probes are failing and triggering a cascade of pages
[12:47:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:47:26] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:47:47] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1416 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:47:48] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[12:48:25] <wikibugs>	 (03PS1) 10Muehlenhoff: squid_exporter: Remove some outdated comments [puppet] - 10https://gerrit.wikimedia.org/r/1026910
[12:50:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T361627)', diff saved to https://phabricator.wikimedia.org/P61843 and previous config saved to /var/cache/conftool/dbconfig/20240503-125015-marostegui.json
[12:50:24] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[12:50:41] <vgutierrez>	 hmmm network on eqiad isn't happy
[12:50:43] <vgutierrez>	 https://grafana.wikimedia.org/goto/tUvFRmLSg?orgId=1
[12:51:05] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2430 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:51:14] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:51:39] <Amir1>	 that would explain the ferm issues?
[12:51:44] <vgutierrez>	 it looks like an IPv6 issue (eqiad->eqiad ICMP latency is OK for IPv4 but all over the place for IPv6)
[12:52:26] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:52:38] <vgutierrez>	 topranks: is anything going on with cr1-eqiad? 
[12:52:47] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1397 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:52:55] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove include statement for old private1-a1-codfw range [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097)
[12:53:13] <topranks>	 vgutierrez: I hope not 
[12:53:29] <Amir1>	 topranks: we have like 100 probes failure pages basicall
[12:53:35] <Amir1>	 https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:54:01] <Amir1>	 but things are working
[12:54:06] <Amir1>	 soooo
[12:54:07] * topranks looking 
[12:54:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove include statement for old private1-a1-codfw range [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[12:54:31] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:54:53] <topranks>	 vgutierrez: any specifics on that ivp6 latency?
[12:55:00] <topranks>	 you looking at a graph or doing a particular test?
[12:55:05] <wikibugs>	 (03CR) 10Muehlenhoff: "Not sure, we didn't do this for the previous images based on nodesource debs neither (node14/node16), and I don't think nodesource ships n" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026873 (https://phabricator.wikimedia.org/T362681) (owner: 10Muehlenhoff)
[12:55:10] <vgutierrez>	 topranks: grafana link I shared above
[12:55:16] <topranks>	 sry
[12:55:17] <vgutierrez>	 topranks: https://grafana.wikimedia.org/goto/I8WfgiYSR?orgId=1
[12:55:24] <jelto>	 I have to step out
[12:55:28] <wikibugs>	 (03PS2) 10Muehlenhoff: squid_exporter: Remove some outdated comments [puppet] - 10https://gerrit.wikimedia.org/r/1026910
[12:57:26] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:58:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on mw1452:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1452 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:58:57] <vgutierrez>	 it looks like 8503219001 by topranks triggered the ferm rules update fleet wide
[12:59:18] <topranks>	 ok... 
[12:59:32] <topranks>	 that patch was to remove an unused subnet from puppet defs 
[12:59:33] <vgutierrez>	 aka dropping private-a1-codfw
[12:59:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9767450 (10CDanis) magru is a clear win for: UY, CL, AR, BR, PY  It's better for some but not all users in: BO, PE  {F49974214}
[13:00:52] <vgutierrez>	 topranks: yep.. those ranges are exposed on the hosts with ferm on /etc/ferm/conf.d/00_defs
[13:00:59] <topranks>	 I can't see anything wrong with the network, or reproduce any at a network level 
[13:01:26] <topranks>	 yeah, I guess not good if a change there results in this 
[13:01:38] <topranks>	 but I suppose the answer is nftables?
[13:02:00] <topranks>	 I would have thought it'd be a rolling change as hosts run the puppet agent at different times 
[13:02:26] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:02:48] <elukey>	 the probe down alerts are listing "14 hours ago" that is a little confusing
[13:02:51] <vgutierrez>	 this is the ferm error I'm seeing https://www.irccloud.com/pastebin/ccdNCSXy/
[13:03:04] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[13:03:54] <topranks>	 vgutierrez: hmm ok, so the restart failed 
[13:04:27] <vgutierrez>	 yup, dunno if that would trigger a weird state where the default policy is still DROP and no rules are present
[13:04:28] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:04:48] <vgutierrez>	 hopefully not :)
[13:05:10] <topranks>	 yeah that would definitely not be a good way for it to operate 
[13:05:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P61844 and previous config saved to /var/cache/conftool/dbconfig/20240503-130523-marostegui.json
[13:05:27] <topranks>	 looking at that particular host, mw1397, the old private1-a1-codfw range is not in the ruleset anymore 
[13:05:38] <topranks>	 so it seems to have managed to reload after that?
[13:05:47] <vgutierrez>	 yep, on the next puppet run
[13:05:50] <vgutierrez>	 30 minutes later
[13:06:22] <sukhe>	 we have seen some of these transient ferm reload issues before as well. usually on a few mw* hosts but nothing beyond them
[13:06:32] <topranks>	 yeah exactly, 30 mins later 
[13:06:35] <topranks>	 hmm 
[13:06:51] <elukey>	 folks one thing that I am not getting - why do we have probe downs still present in alerts.w.o? 
[13:07:09] * vgutierrez looking
[13:07:30] <elukey>	 they have not recovered, things seems to work but it is rather worrying
[13:07:39] <elukey>	 also they have 14 hours ago marks
[13:07:58] <Amir1>	 elukey: I got the page for them half an hour ago
[13:08:04] <sukhe>	 !incidents
[13:08:05] <sirenbot>	 4650 (ACKED)  [10x] ProbeDown sre (probes/service eqiad)
[13:08:05] <sirenbot>	 4649 (RESOLVED)  [3x] ProbeDown sre (phab1004:443 probes/custom eqiad)
[13:08:05] <sirenbot>	 4648 (RESOLVED)  db1175 (paged)/MariaDB Replica SQL: s3 (paged)
[13:08:05] <sirenbot>	 4647 (RESOLVED)  db1189 (paged)/MariaDB Replica SQL: s3 (paged)
[13:08:16] <topranks>	 are there still hosts with ferm in state failed?
[13:08:36] <Amir1>	 they are everywhere
[13:08:41] <topranks>	 i.e. we seen on mw1397 it took till next puppet run to restart the service (and presumably restore correct ruleset)
[13:08:50] <sukhe>	 https://puppetboard.wikimedia.org/failures
[13:08:54] <sukhe>	 just mw1416
[13:09:14] <sukhe>	 and even looks good there now
[13:09:44] <moritzm>	 the k8s ones are auto-correcting via a systemd timer
[13:10:22] <moritzm>	 https://phabricator.wikimedia.org/T354855
[13:10:42] <jinxer-wm>	 FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[13:10:58] <Amir1>	 that's not good
[13:11:30] <sukhe>	 that's expected because of magr provisioning but we get alerted for other sites as well which we should not
[13:11:35] <sukhe>	 swfrench-wmf filed a task for that
[13:11:40] <sukhe>	 https://phabricator.wikimedia.org/T363924
[13:11:49] <sukhe>	 s/magr/magru
[13:12:05] <topranks>	 same story on mw1416
[13:12:12] <sukhe>	 once we reimage ncredir7002 today, I will do a cleanup of the confd state
[13:12:20] <topranks>	 ferm failed at 12:38, next puppet run at 13:08 restarted it clean 
[13:12:26] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:12:36] <wikibugs>	 (03PS1) 10Slyngshede: LDAP Eventlog [software/bitu] - 10https://gerrit.wikimedia.org/r/1026919 (https://phabricator.wikimedia.org/T163478)
[13:13:29] <vgutierrez>	 elukey: so.. using the alert query link option on alerts.w.o shows a query with an empty result... https://prometheus-eqiad.wikimedia.org/ops/graph?g0.expr=%28avg_over_time%28probe_success%7Bjob%3D~%22probes%2F.%2A%22%2Cmodule%3D~%22%28http%7Ctcp%29.%2A%22%7D%5B1m%5D%29+and+on+%28instance%29+service_catalog_page+%3D%3D+1%29+%2A+100+%3C+10&g0.tab=1
[13:13:39] <wikibugs>	 (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[13:13:42] <vgutierrez>	 elukey: so those could be stale alerts on klaxon?
[13:14:15] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:15:38] <elukey>	 vgutierrez: o/ sorry I am missing something, what is the relationship with klaxon and the alerts in https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DProbeDown ?
[13:16:11] <wikibugs>	 (03PS2) 10Cathal Mooney: Remove include statement for old private1-a1-codfw range [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097)
[13:16:20] <vgutierrez>	 elukey: brain fart.. what's the name of the UI interface running on alerts.w.o?
[13:16:21] <wikibugs>	 (03CR) 10Ssingh: "; private1-a1-codfw (2620:0:860:105::/64)" [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[13:16:28] <sukhe>	 karma
[13:16:29] <Amir1>	 and why it's paging now instead of 14 hours ago
[13:16:31] <vgutierrez>	 thanks
[13:16:38] <vgutierrez>	 s/klaxon/karma/ :)
[13:16:54] <elukey>	 ahhh okok sorry makes sense!
[13:17:05] <elukey>	 never checked karma, lemme see
[13:17:26] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:17:42] <vgutierrez>	 !incidents
[13:17:43] <sirenbot>	 4650 (ACKED)  [10x] ProbeDown sre (probes/service eqiad)
[13:17:43] <sirenbot>	 4649 (RESOLVED)  [3x] ProbeDown sre (phab1004:443 probes/custom eqiad)
[13:17:43] <sirenbot>	 4648 (RESOLVED)  db1175 (paged)/MariaDB Replica SQL: s3 (paged)
[13:17:43] <sirenbot>	 4647 (RESOLVED)  db1189 (paged)/MariaDB Replica SQL: s3 (paged)
[13:17:47] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1416 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:20:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[13:20:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P61845 and previous config saved to /var/cache/conftool/dbconfig/20240503-132030-marostegui.json
[13:20:35] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "I think it's time to admit that you enjoy messing with v6 PTRs 😊" [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[13:21:03] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2430 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:21:11] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Remove include statement for old private1-a1-codfw range [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[13:21:41] <elukey>	 so far nothing weird in the karma logs
[13:21:50] <elukey>	 (TIL that the alerts' UI is called karma)
[13:22:07] <vgutierrez>	 checking on logstash the logs for the citoid probe I can see a 503 around the time where the dashboard flags the last citoid issue
[13:22:21] <vgutierrez>	 May 3, 2024 @ 13:13:44.170	prometheus1006	target=https://[10.2.2.19]:4003/_info msg="Received HTTP response" status_code=503
[13:22:26] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[13:22:47] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1397 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:23:53] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:24:03] <elukey>	 I see puppet running at around 12:17 UTC on prometheus1006
[13:24:47] <wikibugs>	 (03PS4) 10Dreamrimmer: Enable 'flood' user group at en.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019822 (https://phabricator.wikimedia.org/T351250)
[13:24:58] <Amir1>	 are the logs all on promotehus1006?
[13:24:59] <elukey>	 anything against me restarting karma?
[13:25:09] <vgutierrez>	 Amir1: yep
[13:25:12] <Amir1>	 stupid q: Where can you see the logs?
[13:25:16] <vgutierrez>	 Amir1: https://logstash.wikimedia.org/goto/978e97b1d6f9c475d4d6bc8a6065752f
[13:25:24] <Amir1>	 ah thanks
[13:25:34] <vgutierrez>	 the logstash dashboard is linked on the grafana dashboard BTW
[13:25:47] <vgutierrez>	 oh.. not just 1006, 1005 as well
[13:26:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install7001.wikimedia.org - jmm@cumin2002"
[13:26:59] <elukey>	 !log restart karma on alert1001 to verify if probe down alerts shown are stale
[13:27:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:31] <elukey>	 nope, same thing
[13:27:42] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Stop supporting sslcert in Profile::Pki::Provider type [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[13:27:56] <vgutierrez>	 for netbox it's the same issue, some 503s
[13:28:28] <wikibugs>	 (03CR) 10Bking: [C:03+1] elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff)
[13:28:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install7001.wikimedia.org - jmm@cumin2002"
[13:28:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:28:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install7001.wikimedia.org on all recursors
[13:28:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install7001.wikimedia.org on all recursors
[13:29:07] <vgutierrez>	 so no networking connectivity issues between the probes and the service itself but L7 errors
[13:29:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install7001.wikimedia.org - jmm@cumin2002"
[13:30:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install7001.wikimedia.org - jmm@cumin2002"
[13:30:57] <vgutierrez>	 just pinged o11y folks on -observability.. maybe they can help :)
[13:31:22] <topranks>	 vgutierrez: looking at, for instance that citoid one, the IP is announced by LVS 
[13:31:44] <vgutierrez>	 topranks: yeah... it's targetting citoid.svc.eqiad.wmnet
[13:31:54] <topranks>	 is the issue maybe LVS using v6 to the back-end and failing due to the ferm issue?
[13:32:55] <vgutierrez>	 topranks: LVS doesn't perform v4->v6
[13:33:04] <vgutierrez>	 v4 VIPs have v4 real servers
[13:33:15] <topranks>	 yeah brain fart it just writes the L2 header 
[13:33:18] <vgutierrez>	 indeed
[13:33:19] <topranks>	 yep yep 
[13:33:30] <vgutierrez>	 !on-call
[13:33:41] <herron>	 I'm seeing occasional 503 with curl -v https://citoid.discovery.wmnet:4003/_info from the prom host, maybe 1/5 tries?
[13:34:31] <vgutierrez>	 herron: yeah.. but that shouldn't trigger a p.a.g.e, right?
[13:34:48] <vgutierrez>	 herron: why karma is showing citoid as paging since 14h ago?
[13:34:57] <elukey>	 yes it is very confusing
[13:35:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T361627)', diff saved to https://phabricator.wikimedia.org/P61846 and previous config saved to /var/cache/conftool/dbconfig/20240503-133538-marostegui.json
[13:35:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2167.codfw.wmnet with reason: Maintenance
[13:35:42] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[13:35:53] <Amir1>	 I have to eat something
[13:35:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2167.codfw.wmnet with reason: Maintenance
[13:36:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T361627)', diff saved to https://phabricator.wikimedia.org/P61847 and previous config saved to /var/cache/conftool/dbconfig/20240503-133601-marostegui.json
[13:36:03] <Amir1>	 will be back
[13:36:12] <vgutierrez>	 Amir1: talk to fabfur, he can patch your firmware
[13:36:22] <wikibugs>	 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9767646 (10elukey) Status: Lift Wing codfw has been migrated successfully, we are going to do eqiad on Monday 6th.
[13:40:01] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767654 (10cmooney)
[13:41:06] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767659 (10cmooney) a:03Papaul @papaul I think this one is ready to be moved to rack D1 now.
[13:41:31] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767661 (10cmooney)
[13:43:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install7001.wikimedia.org with OS bookworm
[13:43:29] <wikibugs>	 (03PS1) 10Elukey: Move mw-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412)
[13:45:12] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2243/console" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[13:46:09] <wikibugs>	 (03CR) 10Muehlenhoff: "Filename should be ms-fe1009, not mw-fe1009" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[13:46:20] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9767687 (10elukey) I have also reviewed the non-cpXXXX IPs found in netstat on ms-fe nodes, they seem all belonging to the thumbor pods, that should be u...
[13:47:16] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove lsw1-a1-codfw from homer vars [homer/public] - 10https://gerrit.wikimedia.org/r/1026928 (https://phabricator.wikimedia.org/T364097)
[13:48:26] <wikibugs>	 (03CR) 10Muehlenhoff: Move mw-fe1009's envoy TLS cert to PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[13:49:24] <wikibugs>	 (03PS2) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412)
[13:49:46] <wikibugs>	 (03CR) 10Elukey: "Yes yes PEBCAK, I was puzzled that PCC showed no changes :D" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[13:51:16] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2244/co" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[13:51:52] <wikibugs>	 (03PS3) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412)
[13:53:16] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2245/co" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[13:54:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[13:54:38] <wikibugs>	 (03CR) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[13:57:06] <sukhe>	 topranks: re https://gerrit.wikimedia.org/r/1026928
[13:57:14] <sukhe>	 I am going to be running homer shortly in case you want it to be merged
[13:58:14] <topranks>	 sukhe: thanks, that ones not urgent 
[13:58:27] <topranks>	 it won't cause any network changes once merged anyway, just tidy-up 
[13:58:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T361627)', diff saved to https://phabricator.wikimedia.org/P61848 and previous config saved to /var/cache/conftool/dbconfig/20240503-135834-marostegui.json
[13:58:39] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[13:58:44] <sukhe>	 ok. so I should feel free to merge this? asking because I will run homer for the durum/doh hosts!
[13:59:43] <topranks>	 sure feel free to +1 for me, it's safe anyway 
[14:00:39] <sukhe>	 thanks
[14:00:52] <wikibugs>	 (03PS1) 10Elukey: role::swift::proxy: simplify hiera configuration for the tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412)
[14:00:59] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Remove lsw1-a1-codfw from homer vars [homer/public] - 10https://gerrit.wikimedia.org/r/1026928 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[14:02:26] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Remove lsw1-a1-codfw from homer vars [homer/public] - 10https://gerrit.wikimedia.org/r/1026928 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[14:02:37] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:03:18] <wikibugs>	 (03PS2) 10Elukey: role::swift::proxy: simplify hiera configuration for the tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412)
[14:03:18] <wikibugs>	 (03PS4) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412)
[14:04:29] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9767769 (10phaultfinder)
[14:04:33] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2247/console" [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:04:57] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767770 (10cmooney)
[14:05:44] <wikibugs>	 (03Merged) 10jenkins-bot: Remove lsw1-a1-codfw from homer vars [homer/public] - 10https://gerrit.wikimedia.org/r/1026928 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney)
[14:07:05] <wikibugs>	 (03CR) 10Elukey: "Hi folks! Sorry for the broad ping but better safe than sorry :)" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:07:15] <herron>	 !log alert1001:~# systemctl restart prometheus-alertmanager.service 
[14:07:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:33] <wikibugs>	 (03CR) 10Elukey: "@Matthew: my idea would be to depool ms-fe1009, apply the change, ask Traffic to double check, we double check, and then we repool and obs" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:08:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install7001.wikimedia.org with reason: host reimage
[14:09:02] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "No op as expected, but please double check that I haven't missed anything important." [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:10:25] <vgutierrez>	 herron: that restart of alertmanager got rid of the stale alerts?
[14:10:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:11:19] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "This seems sensible to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:11:25] <herron>	 vgutierrez: yeah, looking better now
[14:11:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install7001.wikimedia.org with reason: host reimage
[14:11:36] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9767822 (10CDanis) >>! In T356412#9766840, @MatthewVernon wrote: > I think I have two questions: >  >   # Where is it defined what should and shouldn't g...
[14:12:48] <herron>	 !incidents
[14:12:49] <sirenbot>	 4650 (ACKED)  [10x] ProbeDown sre (probes/service eqiad)
[14:12:49] <sirenbot>	 4649 (RESOLVED)  [3x] ProbeDown sre (phab1004:443 probes/custom eqiad)
[14:12:50] <wikibugs>	 (03PS1) 10Jdrewniak: [Vector 2022] Test night mode disabled on mainpage on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026939 (https://phabricator.wikimedia.org/T362911)
[14:13:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P61849 and previous config saved to /var/cache/conftool/dbconfig/20240503-141341-marostegui.json
[14:14:18] <wikibugs>	 (03PS2) 10Jdrewniak: [Vector 2022] Test night mode disabled on mainpage on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026939 (https://phabricator.wikimedia.org/T362911)
[14:14:51] <wikibugs>	 (03CR) 10MVernon: "I think mediawiki nodes also talk to the frontends, for uploads and so on?" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:14:54] <sukhe>	 !log sudo homer asw*magru* commit "add durum and doh hosts in magru"
[14:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:27] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.netbox
[14:15:36] <herron>	 I'll manually resolve 4650 so it doesn't retrigger tomorrow
[14:15:41] <herron>	 !resolve 4650
[14:15:42] <sirenbot>	 4650 (ACKED)  [10x] ProbeDown sre (probes/service eqiad)
[14:16:32] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9767843 (10MatthewVernon) OK, I think I am convinced that this should go ahead. Thanks for your patience :)
[14:16:48] * herron resolved it via the app instead
[14:16:53] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:17:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate system::role for Cassandra services [puppet] - 10https://gerrit.wikimedia.org/r/1026940
[14:18:18] <wikibugs>	 (03PS2) 10Milimetric: Update commons impact metrics  readme [puppet] - 10https://gerrit.wikimedia.org/r/1026597 (https://phabricator.wikimedia.org/T358701)
[14:19:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1026867 (https://phabricator.wikimedia.org/T364105) (owner: 10Btullis)
[14:19:36] <wikibugs>	 (03CR) 10Bking: [C:03+2] Update commons impact metrics  readme [puppet] - 10https://gerrit.wikimedia.org/r/1026597 (https://phabricator.wikimedia.org/T358701) (owner: 10Milimetric)
[14:20:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:22:16] <wikibugs>	 (03CR) 10Mabualruz: [C:03+1] [Vector 2022] Test night mode disabled on mainpage on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026939 (https://phabricator.wikimedia.org/T362911) (owner: 10Jdrewniak)
[14:22:19] <wikibugs>	 (03CR) 10Elukey: "Definitely yes, forgot to mention those. They use envoy as sidecar proxy (both bare metal and k8s) so it should be the same assumption tha" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:23:17] <Bsadowski1>	 Is here an issue with loading?
[14:23:21] <Bsadowski1>	 there*
[14:24:14] <Bsadowski1>	 nevermind it was only brief. Weird.
[14:25:47] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: add 6 codfw appservers as workers [puppet] - 10https://gerrit.wikimedia.org/r/1026941 (https://phabricator.wikimedia.org/T351074)
[14:26:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install7001.wikimedia.org with OS bookworm
[14:26:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install7001.wikimedia.org
[14:26:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add install7001 - jmm@cumin2002"
[14:27:26] <elukey>	 herron: o/ so the probe down errors were cleared restarting alertmanager?
[14:28:40] <herron>	 elukey yes although I'm not sure yet what led to that state
[14:28:44] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "looking good, commit matches what we currently see on swift.discovery.wmnet:" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:28:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P61850 and previous config saved to /var/cache/conftool/dbconfig/20240503-142848-marostegui.json
[14:29:07] <elukey>	 herron: super thanks, just to know how to fix in case it re-happens
[14:30:07] <wikibugs>	 (03PS1) 10Ssingh: hiera: update installserver for magru [puppet] - 10https://gerrit.wikimedia.org/r/1026944 (https://phabricator.wikimedia.org/T346722)
[14:31:26] <wikibugs>	 (03PS1) 10Ssingh: sites: update installserver for magru [homer/public] - 10https://gerrit.wikimedia.org/r/1026945 (https://phabricator.wikimedia.org/T346722)
[14:34:35] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "That seems a reasonable approach to me, thanks. NB I'm OOO on Monday 6th." [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:38:06] <wikibugs>	 (03CR) 10Elukey: "Super let's sync for Tuesday if you want, we can ping each other on IRC and see if we have time to do it. I'll work only in the afternoon," [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[14:39:16] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@5d3a06d] (releasing): test plugin update in secondary host
[14:39:38] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@5d3a06d] (releasing): test plugin update in secondary host (duration: 00m 22s)
[14:40:12] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:41:02] <wikibugs>	 (03Abandoned) 10Jdrewniak: [Vector 2022] Test night mode disabled on mainpage on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026939 (https://phabricator.wikimedia.org/T362911) (owner: 10Jdrewniak)
[14:43:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T361627)', diff saved to https://phabricator.wikimedia.org/P61851 and previous config saved to /var/cache/conftool/dbconfig/20240503-144356-marostegui.json
[14:43:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2181.codfw.wmnet with reason: Maintenance
[14:43:59] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[14:44:02] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@5d3a06d] (releasing): update plugins to address vulnerabilities
[14:44:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2181.codfw.wmnet with reason: Maintenance
[14:44:17] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] sites: update installserver for magru [homer/public] - 10https://gerrit.wikimedia.org/r/1026945 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh)
[14:44:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T361627)', diff saved to https://phabricator.wikimedia.org/P61852 and previous config saved to /var/cache/conftool/dbconfig/20240503-144419-marostegui.json
[14:44:42] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@5d3a06d] (releasing): update plugins to address vulnerabilities (duration: 00m 39s)
[14:45:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:48:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add install7001 - jmm@cumin2002"
[14:51:22] <wikibugs>	 06SRE, 06Traffic, 10Data Products (Data Products Sprint 13): Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9768015 (10VirginiaPoundstone)
[14:52:31] <wikibugs>	 06SRE, 06Data Products, 06Traffic: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9768036 (10VirginiaPoundstone)
[14:52:41] <wikibugs>	 (03PS1) 10Ahmon Dancy: Use buildkit wmf-v0.13.2-1 on WMCS and trusted runners (v2) [puppet] - 10https://gerrit.wikimedia.org/r/1026949 (https://phabricator.wikimedia.org/T364013)
[14:53:04] <wikibugs>	 06SRE, 06Data Products, 06Traffic: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9768033 (10VirginiaPoundstone) Once https://phabricator.wikimedia.org/T351117 is complete, this may need a spike to check if issue persists.
[14:57:48] <wikibugs>	 (03PS1) 10Bking: elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609)
[14:58:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking)
[14:58:25] <wikibugs>	 (03CR) 10Elukey: [C:03+1] modules: Add restrictedSecurityContext to statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026556 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[14:59:38] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] modules: Add restrictedSecurityContext to statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026556 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[14:59:41] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] New version of statds module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026555 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[14:59:43] <wikibugs>	 (03PS1) 10Elukey: amd/pytorch21: update ROCm drivers to 5.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026952 (https://phabricator.wikimedia.org/T362984)
[15:00:11] <wikibugs>	 (03PS2) 10Bking: elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609)
[15:00:12] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:00:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking)
[15:00:38] <wikibugs>	 (03Merged) 10jenkins-bot: New version of statds module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026555 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[15:00:40] <wikibugs>	 (03Merged) 10jenkins-bot: modules: Add restrictedSecurityContext to statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026556 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[15:01:49] <wikibugs>	 (03PS3) 10Bking: elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609)
[15:01:50] <wikibugs>	 (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026952 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey)
[15:02:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking)
[15:07:42] <wikibugs>	 (03CR) 10DannyS712: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026691 (https://phabricator.wikimedia.org/T364039) (owner: 10Superzerocool)
[15:08:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T361627)', diff saved to https://phabricator.wikimedia.org/P61853 and previous config saved to /var/cache/conftool/dbconfig/20240503-150846-marostegui.json
[15:08:50] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[15:11:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:14:29] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9768112 (10phaultfinder)
[15:14:46] <wikibugs>	 (03PS4) 10Bking: elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609)
[15:17:46] <wikibugs>	 (03PS1) 10Elukey: kserve-inference: add securityContext explicit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978)
[15:17:51] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:18:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] kserve-inference: add securityContext explicit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey)
[15:21:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:23:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P61854 and previous config saved to /var/cache/conftool/dbconfig/20240503-152354-marostegui.json
[15:23:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:26:45] <dcausse>	 !log depooled wdqs1012 (lagged)
[15:26:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:23] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking)
[15:29:11] <wikibugs>	 (03CR) 10Gehel: [C:03+1] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking)
[15:29:43] <wikibugs>	 (03CR) 10Bking: [C:03+2] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking)
[15:31:44] <wikibugs>	 (03PS1) 10Cathal Mooney: Add VM BGP for esams/drmrs/magru back to YAML for now [homer/public] - 10https://gerrit.wikimedia.org/r/1026956 (https://phabricator.wikimedia.org/T362421)
[15:32:42] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] wikifunctions: Allow prometheus to scrape metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026441 (https://phabricator.wikimedia.org/T350034) (owner: 10JMeybohm)
[15:33:21] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "awesome, nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[15:33:29] <wikibugs>	 (03CR) 10Cathal Mooney: [C:04-1] "Don't merge - will remove peerings to physical servers like dns3003!" [homer/public] - 10https://gerrit.wikimedia.org/r/1026956 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney)
[15:33:43] <wikibugs>	 (03CR) 10Bking: [C:03+1] Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1026439 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff)
[15:33:54] <wikibugs>	 (03CR) 10Bking: [C:03+1] Remove obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1026438 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff)
[15:34:34] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir7002.magru.wmnet
[15:34:36] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.dns.netbox
[15:39:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P61855 and previous config saved to /var/cache/conftool/dbconfig/20240503-153901-marostegui.json
[15:39:04] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9768239 (10Gehel)
[15:39:12] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7002.magru.wmnet - brett@cumin2002"
[15:40:06] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7002.magru.wmnet - brett@cumin2002"
[15:40:06] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:40:06] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir7002.magru.wmnet on all recursors
[15:40:09] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7002.magru.wmnet on all recursors
[15:40:23] <wikibugs>	 06SRE-OnFire, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 03Discovery-Search (Current work), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694#9768264 (10Gehel) p:05Triage→03High
[15:40:33] <wikibugs>	 06SRE-OnFire, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 03Discovery-Search (Current work), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694#9768267 (10Gehel)
[15:41:02] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9768259 (10Gehel)
[15:41:45] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7002.magru.wmnet - brett@cumin2002"
[15:42:39] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7002.magru.wmnet - brett@cumin2002"
[15:48:26] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[15:54:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T361627)', diff saved to https://phabricator.wikimedia.org/P61856 and previous config saved to /var/cache/conftool/dbconfig/20240503-155409-marostegui.json
[15:54:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2195.codfw.wmnet with reason: Maintenance
[15:54:13] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[15:54:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2195.codfw.wmnet with reason: Maintenance
[15:54:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T361627)', diff saved to https://phabricator.wikimedia.org/P61857 and previous config saved to /var/cache/conftool/dbconfig/20240503-155432-marostegui.json
[16:00:14] <wikibugs>	 (03CR) 10Scott French: [C:03+1] kubernetes: add 6 codfw appservers as workers [puppet] - 10https://gerrit.wikimedia.org/r/1026941 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[16:01:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9768394 (10CDanis) Oh, and I think magru is a win for SV as well.
[16:02:18] <wikibugs>	 (03CR) 10Klausman: [C:03+1] amd/pytorch21: update ROCm drivers to 5.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026952 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey)
[16:02:39] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] amd/pytorch21: update ROCm drivers to 5.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026952 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey)
[16:04:47] <wikibugs>	 (03PS1) 10Btullis: Add the wmf-java-cacerts truststore to all remaining airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026964 (https://phabricator.wikimedia.org/T362181)
[16:05:28] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:06:51] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026964 (https://phabricator.wikimedia.org/T362181) (owner: 10Btullis)
[16:07:10] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9768411 (10phaultfinder)
[16:12:03] <wikibugs>	 (03CR) 10Aklapper: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1024348 (https://phabricator.wikimedia.org/T362804) (owner: 10Aklapper)
[16:14:31] <icinga-wm>	 RECOVERY - ps1-a7-codfw-infeed-load-tower-A-phase-Z on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-A-phase-Z 377 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:14:31] <icinga-wm>	 RECOVERY - ps1-a7-codfw-infeed-load-tower-A-phase-Y on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-A-phase-Y 168 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:14:31] <icinga-wm>	 RECOVERY - ps1-a7-codfw-infeed-load-tower-B-phase-Y on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-B-phase-Y 207 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:14:31] <icinga-wm>	 RECOVERY - ps1-a7-codfw-infeed-load-tower-A-phase-X on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-A-phase-X 388 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:14:31] <icinga-wm>	 RECOVERY - ps1-a7-codfw-infeed-load-tower-B-phase-X on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-B-phase-X 357 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:14:32] <icinga-wm>	 RECOVERY - ps1-a7-codfw-infeed-load-tower-B-phase-Z on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-B-phase-Z 368 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:14:32] <icinga-wm>	 RECOVERY - Host lsw1-a7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.71 ms
[16:14:33] <icinga-wm>	 RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.02 ms
[16:15:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T361627)', diff saved to https://phabricator.wikimedia.org/P61858 and previous config saved to /var/cache/conftool/dbconfig/20240503-161531-marostegui.json
[16:15:35] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[16:16:54] <wikibugs>	 (03CR) 10JHathaway: [C:04-1] "I think this is the correct solution after resolving one inline question." [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott)
[16:17:50] <wikibugs>	 (03CR) 10JHathaway: "I think this change can be abandoned in favor of, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026682" [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott)
[16:18:29] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add the wmf-java-cacerts truststore to all remaining airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026964 (https://phabricator.wikimedia.org/T362181) (owner: 10Btullis)
[16:18:52] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir7002.magru.wmnet with OS bookworm
[16:19:33] <icinga-wm>	 RECOVERY - ps1-b8-codfw-infeed-load-tower-A-phase-X on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-A-phase-X 585 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:19:33] <icinga-wm>	 RECOVERY - ps1-b8-codfw-infeed-load-tower-A-phase-Z on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-A-phase-Z 312 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:19:33] <icinga-wm>	 RECOVERY - ps1-b8-codfw-infeed-load-tower-B-phase-Z on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-B-phase-Z 299 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:19:33] <icinga-wm>	 RECOVERY - ps1-b8-codfw-infeed-load-tower-A-phase-Y on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-A-phase-Y 278 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:19:33] <icinga-wm>	 RECOVERY - ps1-b8-codfw-infeed-load-tower-B-phase-Y on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-B-phase-Y 248 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:19:34] <icinga-wm>	 RECOVERY - ps1-b8-codfw-infeed-load-tower-B-phase-X on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-B-phase-X 580 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:19:35] <icinga-wm>	 RECOVERY - Host ps1-b8-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.25 ms
[16:19:37] <icinga-wm>	 RECOVERY - Host lsw1-b8-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.67 ms
[16:24:03] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9768463 (10Papaul) 05Open→03Resolved a:03Papaul Resolved by rebooting both switches
[16:30:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P61859 and previous config saved to /var/cache/conftool/dbconfig/20240503-163039-marostegui.json
[16:34:56] <wikibugs>	 (03PS1) 10Jsn.sherman: Add AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026972 (https://phabricator.wikimedia.org/T364034)
[16:34:58] <wikibugs>	 (03PS1) 10Jsn.sherman: Deploy AutoModerator to Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026973 (https://phabricator.wikimedia.org/T364034)
[16:34:59] <wikibugs>	 (03PS1) 10Jsn.sherman: Add AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026974 (https://phabricator.wikimedia.org/T364034)
[16:35:01] <wikibugs>	 (03PS1) 10Jsn.sherman: CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034)
[16:35:57] <wikibugs>	 (03PS1) 10Elukey: Remove golang 1.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976
[16:36:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman)
[16:37:01] <wikibugs>	 (03CR) 10Elukey: "I don't see any production-image with depend-on golang1.14, also this doesn't remove it from the docker registry so it should be safe. Lem" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976 (owner: 10Elukey)
[16:37:07] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] phabricator: increase phabricator page delay to 4m [puppet] - 10https://gerrit.wikimedia.org/r/1026801 (https://phabricator.wikimedia.org/T362401) (owner: 10Jelto)
[16:37:28] <wikibugs>	 (03PS1) 10JHathaway: WIP: puppetdb: remove unused hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1026977
[16:39:22] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026977 (owner: 10JHathaway)
[16:44:13] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7002.magru.wmnet with reason: host reimage
[16:45:07] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] phabricator: increase phabricator page delay to 4m [puppet] - 10https://gerrit.wikimedia.org/r/1026801 (https://phabricator.wikimedia.org/T362401) (owner: 10Jelto)
[16:45:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P61860 and previous config saved to /var/cache/conftool/dbconfig/20240503-164546-marostegui.json
[16:46:59] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7002.magru.wmnet with reason: host reimage
[16:47:04] <wikibugs>	 (03PS1) 10Hoo man: Remove Cognate virtual domain mapping b/c code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026982 (https://phabricator.wikimedia.org/T348526)
[16:47:14] <wikibugs>	 (03PS2) 10JHathaway: puppetdb: remove unused hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1026977 (https://phabricator.wikimedia.org/T337970)
[16:48:05] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026977 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway)
[16:49:20] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] delete cert for query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1026622 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn)
[16:51:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm, removed from DNS in https://gerrit.wikimedia.org/r/c/operations/dns/+/884276" [puppet] - 10https://gerrit.wikimedia.org/r/1026797 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff)
[17:00:14] <wikibugs>	 (03PS1) 10Dzahn: delete civicrm-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1026986
[17:00:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T361627)', diff saved to https://phabricator.wikimedia.org/P61862 and previous config saved to /var/cache/conftool/dbconfig/20240503-170054-marostegui.json
[17:00:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance
[17:01:00] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[17:01:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance
[17:04:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10probenet, 06Traffic: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318#9768653 (10CDanis)
[17:04:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10probenet: compare Probenet data w/ NEL data - https://phabricator.wikimedia.org/T337317#9768656 (10CDanis)
[17:04:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10probenet, and 2 others: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9768659 (10CDanis)
[17:07:33] <wikibugs>	 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9768669 (10andrea.denisse) Output of the requested commands:  ` denisse@centrallog1002:~$ sudo sgdisk -R=/dev/sdg /dev/sdh The operation has completed successfully. ` ` denisse@centrallog1002:~$  sudo sgdisk -G /dev...
[17:11:47] <jinxer-wm>	 FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[17:13:47] <denisse>	 !log Run `sudo mdadm --add /dev/md1 /dev/sdg` on `centrallog1002` - T363660
[17:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:51] <stashbot>	 T363660: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660
[17:14:14] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7002.magru.wmnet with OS bookworm
[17:14:14] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7002.magru.wmnet
[17:15:05] <wikibugs>	 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9768720 (10andrea.denisse) ` denisse@centrallog1002:~$ sudo mdadm --add /dev/md1 /dev/sdg mdadm: added /dev/sdg `
[17:17:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:27:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance
[17:27:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance
[17:36:19] <wikibugs>	 (03CR) 10Jelto: [C:03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1026949 (https://phabricator.wikimedia.org/T364013) (owner: 10Ahmon Dancy)
[17:45:24] <dcausse>	 !log repooling wdqs1012
[17:45:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:30] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+2] delete civicrm-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1026986 (owner: 10Dzahn)
[18:17:50] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+2] "Looks good. I'll check in a bit to see if there are any civi1001 references to clean up." [dns] - 10https://gerrit.wikimedia.org/r/1026986 (owner: 10Dzahn)
[18:18:59] <wikibugs>	 (03CR) 10Dzahn: "thanks :)" [dns] - 10https://gerrit.wikimedia.org/r/1026986 (owner: 10Dzahn)
[18:19:09] <wikibugs>	 (03PS2) 10Dzahn: delete civicrm-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1026986
[18:28:37] <logmsgbot>	 !log brett@cumin2002 conftool action : set/weight=1; selector: name=ncredir7001.magru.wmnet,service=nginx
[18:29:13] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir7001.magru.wmnet,service=nginx
[18:29:19] <logmsgbot>	 !log brett@cumin2002 conftool action : set/weight=1; selector: name=ncredir7002.magru.wmnet,service=nginx
[18:29:25] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir7002.magru.wmnet,service=nginx
[18:30:42] <jinxer-wm>	 FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[18:31:47] <jinxer-wm>	 FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[18:32:30] <mutante>	 brett: ^ I think this matches "possible causes:     A service with no backends weighted/pooled"
[18:32:43] <brett>	 Hm
[18:32:46] <mutante>	 based on https://config-master.wikimedia.org/pybal/magru/  there is no service nginx in magru yet ?
[18:33:23] <brett>	 There should be
[18:33:23] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920)
[18:33:27] <mutante>	 is it supposed to be  service=ncredir or =ncredir-https ?
[18:33:59] <mutante>	 because that's where the backends show up: https://config-master.wikimedia.org/pybal/magru/ncredir
[18:34:37] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920)
[18:34:57] <wikibugs>	 (03PS1) 10Ahmon Dancy: Use buildkit wmf-v0.13.2-1 on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1027002 (https://phabricator.wikimedia.org/T364013)
[18:34:57] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper)
[18:36:06] <brett>	 Everything seems to match up with e.g. eqiad
[18:36:37] <mutante>	 I don't see any service called nginx though?
[18:36:37] <brett>	 Maybe pybal needs a restart?
[18:36:43] <wikibugs>	 (03CR) 10Ahmon Dancy: "Sorry for the multiple changes.  I need to be better search/replace next time around." [puppet] - 10https://gerrit.wikimedia.org/r/1027002 (https://phabricator.wikimedia.org/T364013) (owner: 10Ahmon Dancy)
[18:37:03] <mutante>	 while it mentions "/pools/eqsin/ncredir/nginx/" the service is called "ncredir", no?
[18:37:33] <brett>	 oh, oh
[18:38:27] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir7002.magru.wmnet,service=nginx
[18:38:32] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir7001.magru.wmnet,service=nginx
[18:39:36] <sukhe>	 mutante: cluster ncredir, service nginx
[18:39:53] <sukhe>	     lvs:
[18:39:53] <sukhe>	       class: high-traffic1
[18:39:53] <sukhe>	       conftool:
[18:39:53] <sukhe>	         cluster: ncredir
[18:39:53] <sukhe>	         service: nginx
[18:39:57] <sukhe>	 this one basically
[18:40:12] <mutante>	 sukhe: oh, right! ack
[18:43:47] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920)
[18:43:48] <mutante>	 https://prometheus-codfw.wikimedia.org/ops/graph?g0.expr=sum%20by%20(name%2C%20instance)%20(confd_resource_healthy)%20%2F%20count%20by%20(name%2C%20instance)%20(confd_resource_healthy)%20%3C%201&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h
[18:43:58] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir7001.magru.wmnet,service=nginx
[18:44:03] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir7002.magru.wmnet,service=nginx
[18:44:08] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper)
[18:44:51] <mutante>	 failed linting '/usr/local/bin/pybal-eval-check /srv/config-master/pybal/magru/.ncredir-https1451247916' with 1 (0.027543067932128906s) [invalid]: server pool cannot be empty! 
[18:45:15] <sukhe>	 yeah I think I am going to clean the state
[18:46:03] <wikibugs>	 (03CR) 10Jelto: [C:03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1027002 (https://phabricator.wikimedia.org/T364013) (owner: 10Ahmon Dancy)
[18:46:04] <sukhe>	 mutante: so what happened was at one point there was nothing pooled for magru 
[18:46:35] <sukhe>	 and essentially we were waiting for us to finish reimaging and pooling everything. since brett finished ncredir, we will try to clean it up
[18:46:59] <mutante>	 sukhe: nod... maybe it dislikes it just because the previous state was "pool empty"
[18:47:09] <sukhe>	 yep
[18:47:50] <sukhe>	 https://phabricator.wikimedia.org/T363924 see also on why we are getting alerted for non-magru sites
[18:49:00] <mutante>	 wow, and that ticket 2 days old.. there is always already something :)
[18:49:03] <sukhe>	 which was a TIL for me till swfrench-wm.f filed it :)
[18:50:48] <mutante>	 when you say "clear the state", you mean something like:
[18:50:50] <mutante>	 [puppetmaster2001:/var/run/confd-template] $ rm .ncredir*
[18:50:51] <mutante>	 ?
[18:52:15] <sukhe>	 yep
[18:52:21] <sukhe>	 https://wikitech.wikimedia.org/wiki/Confd#Stale_template_error_files_present 
[18:52:49] <mutante>	 yea, I remember that one. when deleting the .err files the monitoring cleared up
[18:52:52] <sukhe>	 and similar for .upload and .text
[18:53:04] <sukhe>	 I think I will do it now
[18:53:13] <sukhe>	 since we have done everything in magru
[19:00:42] <jinxer-wm>	 FIRING: [72x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:01:27] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:02:04] <sukhe>	 !log cleaning up stale confd template files for magru related reimaging
[19:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:56] <wikibugs>	 06SRE, 06collaboration-services, 06serviceops: upgrade deployment servers to bullseye / add bullseye support to puppet role - https://phabricator.wikimedia.org/T363415#9768969 (10Dzahn)
[19:05:42] <jinxer-wm>	 FIRING: [72x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:06:47] <jinxer-wm>	 RESOLVED: [72x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:06:53] <sukhe>	 ok :)
[19:27:53] <wikibugs>	 (03CR) 10Bking: "based on PCC , it looks like we might need to add codfw as a valid site for wdqs-test cluster in the list of clusters...which is in hierad" [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper)
[19:33:05] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920)
[19:38:58] <wikibugs>	 (03PS5) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920)
[19:39:05] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper)
[19:47:26] <wikibugs>	 (03PS1) 10JHathaway: pcc: fix delete-canceled-pcc-run-dirs timer [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173)
[19:47:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pcc: fix delete-canceled-pcc-run-dirs timer [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) (owner: 10JHathaway)
[19:48:32] <wikibugs>	 (03PS6) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920)
[19:52:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[19:56:09] <wikibugs>	 (03PS2) 10JHathaway: pcc: fix delete-canceled-pcc-run-dirs timer [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173)
[19:56:32] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper)
[19:59:52] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) (owner: 10JHathaway)
[20:00:43] <icinga-wm>	 RECOVERY - MD RAID on centrallog1002 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[20:59:39] <wikibugs>	 (03CR) 10Bking: [C:03+1] wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper)
[21:22:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:27:31] <ryankemper>	 !log T362920 [wdqs] Depooled `wdqs2023` in preparation to switch it to a graph split host
[21:27:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:34] <stashbot>	 T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors) - https://phabricator.wikimedia.org/T362920
[21:29:09] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper)
[21:34:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2023:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:38:10] <ryankemper>	 ^ Forgot to downtime
[21:38:44] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on wdqs2023.codfw.wmnet with reason: T362920
[21:38:47] <stashbot>	 T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors) - https://phabricator.wikimedia.org/T362920
[21:38:48] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on wdqs2023.codfw.wmnet with reason: T362920
[22:01:04] <wikibugs>	 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9769326 (10andrea.denisse) The resync finished.  ` sudo cat /proc/mdstat                                                      centrallog1002: Fri May  3 22:00:07 2024  Personalities : [raid10] [linear] [multipath] [...
[22:02:14] <wikibugs>	 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9769327 (10andrea.denisse) Thanks to @VRiley-WMF and @Jclark-ctr for their help debugging and troubleshooting this issue, it was a hard one! ❤
[22:06:36] <wikibugs>	 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9769328 (10andrea.denisse) 05Open→03Resolved
[22:20:48] <icinga-wm>	 RECOVERY - Router interfaces on cr1-magru is OK: OK: host 195.200.68.128, interfaces up: 48, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:23:03] <wikibugs>	 (03PS1) 10Scott French: mathoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027050
[22:27:22] <wikibugs>	 (03PS2) 10Scott French: mathoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027050 (https://phabricator.wikimedia.org/T362978)
[22:33:57] <wikibugs>	 (03CR) 10Scott French: "Decided to give this a try on an "easy mode" chart after our chat this morning. If you have cycles to review, that would be greatly apprec" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027050 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[22:44:36] <wikibugs>	 (03PS1) 10Dzahn: admin: create group fr-tech-devs, apply to role crm - WIP [puppet] - 10https://gerrit.wikimedia.org/r/1027052
[23:01:27] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:38:39] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1026893
[23:38:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1026893 (owner: 10TrainBranchBot)
[23:52:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors