[00:00:01] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1025915 (owner: 10TrainBranchBot) [00:04:29] win 46 [00:06:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T361627)', diff saved to https://phabricator.wikimedia.org/P61788 and previous config saved to /var/cache/conftool/dbconfig/20240503-000602-marostegui.json [00:06:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1249.eqiad.wmnet with reason: Maintenance [00:06:06] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [00:06:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1249.eqiad.wmnet with reason: Maintenance [00:06:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T361627)', diff saved to https://phabricator.wikimedia.org/P61789 and previous config saved to /var/cache/conftool/dbconfig/20240503-000614-marostegui.json [00:07:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:32] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:11:22] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Swift [00:18:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T361627)', diff saved to https://phabricator.wikimedia.org/P61790 and previous config saved to /var/cache/conftool/dbconfig/20240503-001805-marostegui.json [00:18:12] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [00:33:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P61791 and previous config saved to /var/cache/conftool/dbconfig/20240503-003313-marostegui.json [00:48:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P61792 and previous config saved to /var/cache/conftool/dbconfig/20240503-004821-marostegui.json [01:03:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T361627)', diff saved to https://phabricator.wikimedia.org/P61793 and previous config saved to /var/cache/conftool/dbconfig/20240503-010330-marostegui.json [01:03:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:03:33] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [01:03:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:04:05] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Apply updated JDK 8 - eevans@cumin1002 [01:10:42] FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:17:16] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs7003 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [01:20:08] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs7001 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [02:20:32] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 49545600 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:21:32] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:38:54] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:54] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:27:01] (03CR) 10Jdlrobson: Revert "Update wgVectorClientPrefs to wgVectorAppearance" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026630 (owner: 10Jdrewniak) [03:28:48] 10ops-eqiad, 06SRE: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060 (10ops-monitoring-bot) 03NEW [03:48:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:54:27] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364061 (10phaultfinder) 03NEW [03:55:29] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9765807 (10phaultfinder) [03:59:24] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9765808 (10phaultfinder) [04:00:26] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9765809 (10phaultfinder) [04:44:15] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9765835 (10Papaul) @Jclark-ctr @VRiley-WMF when the task was auto generated, it shows that disk sdg1 failed see in task description line below (F) md1 : active raid10 sdh1[4]**// sdg1[2](F)//** sdf1[1] sde1[0] Toda... [04:46:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1238.eqiad.wmnet with reason: Maintenance [04:46:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1238.eqiad.wmnet with reason: Maintenance [04:49:18] (03PS1) 10Marostegui: es1039: Not in setup anymore. [puppet] - 10https://gerrit.wikimedia.org/r/1026703 [04:50:48] (03CR) 10Marostegui: [C:03+2] es1039: Not in setup anymore. [puppet] - 10https://gerrit.wikimedia.org/r/1026703 (owner: 10Marostegui) [04:58:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2098.codfw.wmnet with reason: Maintenance [04:59:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2098.codfw.wmnet with reason: Maintenance [05:02:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:41] (03PS1) 10Marostegui: db1214: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026704 [05:09:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1214', diff saved to https://phabricator.wikimedia.org/P61794 and previous config saved to /var/cache/conftool/dbconfig/20240503-050947-root.json [05:10:42] FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:11:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1214.eqiad.wmnet with OS bookworm [05:14:09] (03CR) 10Marostegui: [C:03+2] db1214: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026704 (owner: 10Marostegui) [05:24:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2152.codfw.wmnet with reason: Maintenance [05:24:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1214.eqiad.wmnet with reason: host reimage [05:24:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2152.codfw.wmnet with reason: Maintenance [05:24:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T361627)', diff saved to https://phabricator.wikimedia.org/P61795 and previous config saved to /var/cache/conftool/dbconfig/20240503-052430-marostegui.json [05:24:33] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:27:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1214.eqiad.wmnet with reason: host reimage [05:35:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:35:52] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:36:04] (03PS1) 10Marostegui: Revert "db1214: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026640 [05:40:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 14 Jun 2024 01:28:50 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:41:10] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:45:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T361627)', diff saved to https://phabricator.wikimedia.org/P61796 and previous config saved to /var/cache/conftool/dbconfig/20240503-054502-marostegui.json [05:45:08] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:47:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1214.eqiad.wmnet with OS bookworm [05:48:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61797 and previous config saved to /var/cache/conftool/dbconfig/20240503-054818-root.json [05:48:21] (03CR) 10Marostegui: [C:03+2] Revert "db1214: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026640 (owner: 10Marostegui) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240503T0600) [06:00:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P61798 and previous config saved to /var/cache/conftool/dbconfig/20240503-060010-marostegui.json [06:03:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61799 and previous config saved to /var/cache/conftool/dbconfig/20240503-060324-root.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:15:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P61800 and previous config saved to /var/cache/conftool/dbconfig/20240503-061517-marostegui.json [06:18:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61801 and previous config saved to /var/cache/conftool/dbconfig/20240503-061830-root.json [06:25:04] (03PS1) 10Slyngshede: Add cloudtestidm [dns] - 10https://gerrit.wikimedia.org/r/1026711 (https://phabricator.wikimedia.org/T362128) [06:25:56] (03CR) 10CI reject: [V:04-1] Add cloudtestidm [dns] - 10https://gerrit.wikimedia.org/r/1026711 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [06:27:22] (03PS2) 10Slyngshede: Add cloudtestidm [dns] - 10https://gerrit.wikimedia.org/r/1026711 (https://phabricator.wikimedia.org/T362128) [06:30:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T361627)', diff saved to https://phabricator.wikimedia.org/P61802 and previous config saved to /var/cache/conftool/dbconfig/20240503-063025-marostegui.json [06:30:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2154.codfw.wmnet with reason: Maintenance [06:30:29] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:30:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2154.codfw.wmnet with reason: Maintenance [06:30:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T361627)', diff saved to https://phabricator.wikimedia.org/P61803 and previous config saved to /var/cache/conftool/dbconfig/20240503-063048-marostegui.json [06:33:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61804 and previous config saved to /var/cache/conftool/dbconfig/20240503-063336-root.json [06:41:34] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1231 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1025916 (https://phabricator.wikimedia.org/T364067) [06:41:38] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1025917 (https://phabricator.wikimedia.org/T364067) [06:47:20] (03PS1) 10Muehlenhoff: Make ganeti7002 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026715 (https://phabricator.wikimedia.org/T363978) [06:47:34] (03PS1) 10Jdlrobson: Enable night mode on beta cluster desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026716 (https://phabricator.wikimedia.org/T354889) [06:48:23] (03CR) 10Muehlenhoff: [C:03+2] Make ganeti7002 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026715 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff) [06:48:40] 06SRE, 06Commons, 10MediaWiki-File-management, 06serviceops, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155#9765986 (10C.Suthorn) > Loading the original file or the 800px thumb would probably be non-ideal, partic... [06:48:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61805 and previous config saved to /var/cache/conftool/dbconfig/20240503-064842-root.json [06:49:21] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for linafaridwmde - https://phabricator.wikimedia.org/T364068 (10Lina_Farid_WMDE) 03NEW [06:53:13] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for linafaridwmde - https://phabricator.wikimedia.org/T364068#9766013 (10Lina_Farid_WMDE) [06:55:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T361627)', diff saved to https://phabricator.wikimedia.org/P61806 and previous config saved to /var/cache/conftool/dbconfig/20240503-065547-marostegui.json [06:55:56] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:58:54] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240503T0700) [07:03:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61807 and previous config saved to /var/cache/conftool/dbconfig/20240503-070347-root.json [07:10:54] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:10:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P61808 and previous config saved to /var/cache/conftool/dbconfig/20240503-071057-marostegui.json [07:11:30] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:13:01] (03PS1) 10Muehlenhoff: Make ganeti7004 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026720 (https://phabricator.wikimedia.org/T363978) [07:14:53] (03CR) 10Muehlenhoff: [C:03+2] Make ganeti7004 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026720 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff) [07:18:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61809 and previous config saved to /var/cache/conftool/dbconfig/20240503-071853-root.json [07:24:51] (03PS1) 10Marostegui: es1032: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1026722 [07:25:08] (03PS2) 10Jdlrobson: Enable night mode on beta cluster desktop for all page views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026716 (https://phabricator.wikimedia.org/T354889) [07:25:25] (03CR) 10Muehlenhoff: [C:03+2] Add magru02 to netbox config [puppet] - 10https://gerrit.wikimedia.org/r/1026587 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff) [07:26:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P61810 and previous config saved to /var/cache/conftool/dbconfig/20240503-072604-marostegui.json [07:27:15] (03CR) 10Marostegui: [C:03+2] es1032: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1026722 (owner: 10Marostegui) [07:27:54] (03PS1) 10Muehlenhoff: Add install7001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1026729 (https://phabricator.wikimedia.org/T364016) [07:32:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7004.magru.wmnet to cluster magru02 and group B4 [07:32:37] (03CR) 10Kosta Harlan: [C:03+1] "IIRC we need to pull this patch on the deployment server to avoid surprises during the next deployment windows on Monday. Once that's clar" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026716 (https://phabricator.wikimedia.org/T354889) (owner: 10Jdlrobson) [07:32:39] (03CR) 10Ladsgroup: [C:03+2] Enable night mode on beta cluster desktop for all page views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026716 (https://phabricator.wikimedia.org/T354889) (owner: 10Jdlrobson) [07:32:49] !log zabe@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'Arnadh2011' 'User435211' # T363654 [07:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:53] T363654: Stuck global rename [119980] - https://phabricator.wikimedia.org/T363654 [07:33:02] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7004.magru.wmnet to cluster magru02 and group B4 [07:33:42] (03Merged) 10jenkins-bot: Enable night mode on beta cluster desktop for all page views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026716 (https://phabricator.wikimedia.org/T354889) (owner: 10Jdlrobson) [07:34:06] (03CR) 10Muehlenhoff: [C:03+2] Add install7001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1026729 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff) [07:35:34] (03PS1) 10Muehlenhoff: Add bast7001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1026786 (https://phabricator.wikimedia.org/T364016) [07:37:22] (03CR) 10Muehlenhoff: [C:03+2] Add bast7001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1026786 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff) [07:39:09] (03PS1) 10Muehlenhoff: preseed: Extend globbing for bast and prometheus to cover magru [puppet] - 10https://gerrit.wikimedia.org/r/1026787 (https://phabricator.wikimedia.org/T364016) [07:41:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T361627)', diff saved to https://phabricator.wikimedia.org/P61811 and previous config saved to /var/cache/conftool/dbconfig/20240503-074112-marostegui.json [07:41:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2161.codfw.wmnet with reason: Maintenance [07:41:15] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:41:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2161.codfw.wmnet with reason: Maintenance [07:41:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T361627)', diff saved to https://phabricator.wikimedia.org/P61812 and previous config saved to /var/cache/conftool/dbconfig/20240503-074135-marostegui.json [07:43:20] (03CR) 10Muehlenhoff: [C:03+2] preseed: Extend globbing for bast and prometheus to cover magru [puppet] - 10https://gerrit.wikimedia.org/r/1026787 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff) [07:48:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:49:03] (03PS1) 10Zabe: Initial configuration for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026788 (https://phabricator.wikimedia.org/T362529) [07:52:11] 06SRE, 10SRE-Access-Requests, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users for linafaridwmde - https://phabricator.wikimedia.org/T364068#9766190 (10awight) [07:53:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast7001.wikimedia.org [07:53:43] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:57:42] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7001.wikimedia.org - jmm@cumin2002" [07:59:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7001.wikimedia.org - jmm@cumin2002" [07:59:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:59:20] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast7001.wikimedia.org on all recursors [07:59:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7001.wikimedia.org on all recursors [07:59:47] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9766206 (10phaultfinder) [08:00:06] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7001.wikimedia.org - jmm@cumin2002" [08:00:26] 06SRE, 10SRE-Access-Requests, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users for linafaridwmde - https://phabricator.wikimedia.org/T364068#9766207 (10Lena_WMDE) As the manager of @Lina_Farid_WMDE I approve the request. [08:00:42] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9766208 (10phaultfinder) [08:00:44] (03PS1) 10Slyngshede: P:trafficserver::backend add cloudtestidm [puppet] - 10https://gerrit.wikimedia.org/r/1026790 (https://phabricator.wikimedia.org/T362128) [08:00:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7001.wikimedia.org - jmm@cumin2002" [08:05:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast7001.wikimedia.org with OS bookworm [08:06:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T361627)', diff saved to https://phabricator.wikimedia.org/P61813 and previous config saved to /var/cache/conftool/dbconfig/20240503-080649-marostegui.json [08:06:52] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:11:54] !log installing emacs security updates [08:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:11] (03CR) 10Slyngshede: "There's quite a bit of code here, but some of it allows us to remove existing code in a followup patch." [software/bitu] - 10https://gerrit.wikimedia.org/r/1026458 (owner: 10Slyngshede) [08:17:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1026711 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [08:20:07] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [08:21:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P61814 and previous config saved to /var/cache/conftool/dbconfig/20240503-082156-marostegui.json [08:24:06] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1026790 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [08:26:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1026692 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [08:28:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1026693 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [08:28:43] (03CR) 10Slyngshede: [C:03+2] Add cloudtestidm [dns] - 10https://gerrit.wikimedia.org/r/1026711 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [08:30:13] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [08:32:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1026622 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [08:33:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast7001.wikimedia.org with reason: host reimage [08:34:45] (03PS1) 10Muehlenhoff: Remove obsolete certs for ldap-corp [puppet] - 10https://gerrit.wikimedia.org/r/1026797 (https://phabricator.wikimedia.org/T323820) [08:36:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7001.wikimedia.org with reason: host reimage [08:37:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P61815 and previous config saved to /var/cache/conftool/dbconfig/20240503-083703-marostegui.json [08:39:15] 06SRE, 06Commons, 10MediaWiki-File-management, 06serviceops, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155#9766334 (10Bawolff) I think if we did deliver the wrong thumbsize, it only makes sense to deliver one la... [08:41:50] (03PS1) 10Jelto: phabricator: increase phabricator page delay to 4m [puppet] - 10https://gerrit.wikimedia.org/r/1026801 (https://phabricator.wikimedia.org/T362401) [08:48:42] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1026801 (https://phabricator.wikimedia.org/T362401) (owner: 10Jelto) [08:48:45] !log restart turnilo [08:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T361627)', diff saved to https://phabricator.wikimedia.org/P61816 and previous config saved to /var/cache/conftool/dbconfig/20240503-085211-marostegui.json [08:52:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2162.codfw.wmnet with reason: Maintenance [08:52:14] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:52:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7001.wikimedia.org with OS bookworm [08:52:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7001.wikimedia.org [08:52:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2162.codfw.wmnet with reason: Maintenance [08:52:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T361627)', diff saved to https://phabricator.wikimedia.org/P61817 and previous config saved to /var/cache/conftool/dbconfig/20240503-085234-marostegui.json [08:56:20] (03CR) 10Jelto: [C:03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1026607 (https://phabricator.wikimedia.org/T364013) (owner: 10Ahmon Dancy) [08:56:34] (03PS1) 10Muehlenhoff: elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) [08:56:59] (03CR) 10CI reject: [V:04-1] elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [08:58:49] (03PS1) 10Muehlenhoff: Stop supporting sslcert in Profile::Pki::Provider type [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) [09:02:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:59] (03PS2) 10Muehlenhoff: elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) [09:03:22] (03CR) 10CI reject: [V:04-1] elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [09:03:49] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1026699 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [09:04:31] (03PS3) 10Muehlenhoff: elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) [09:07:08] (03PS1) 10Muehlenhoff: Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1026806 [09:09:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [09:10:42] FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:11:48] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:17:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T361627)', diff saved to https://phabricator.wikimedia.org/P61818 and previous config saved to /var/cache/conftool/dbconfig/20240503-091750-marostegui.json [09:17:53] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:20:46] (03PS1) 10Ayounsi: magru: alert on Transit BGP sessions [puppet] - 10https://gerrit.wikimedia.org/r/1026808 (https://phabricator.wikimedia.org/T362421) [09:22:52] (03CR) 10Ayounsi: [C:03+2] magru: alert on Transit BGP sessions [puppet] - 10https://gerrit.wikimedia.org/r/1026808 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [09:26:03] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:26:09] RECOVERY - Host ps1-b8-codfw is UP: PING WARNING - Packet loss = 66%, RTA = 0.21 ms [09:26:59] PROBLEM - ps1-b8-codfw-infeed-load-tower-B-phase-X on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:26:59] PROBLEM - ps1-b8-codfw-infeed-load-tower-A-phase-Y on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:26:59] PROBLEM - ps1-b8-codfw-infeed-load-tower-B-phase-Y on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:26:59] PROBLEM - ps1-b8-codfw-infeed-load-tower-A-phase-X on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:26:59] PROBLEM - ps1-b8-codfw-infeed-load-tower-B-phase-Z on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:27:00] PROBLEM - ps1-b8-codfw-infeed-load-tower-A-phase-Z on ps1-b8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:31:44] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv6: Idle - Telxius, AS12956/IPv4: Idle - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:32:32] PROBLEM - Host ps1-b8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [09:32:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P61820 and previous config saved to /var/cache/conftool/dbconfig/20240503-093257-marostegui.json [09:40:22] (03CR) 10TheDJ: "Scheduled this for 9th of may puppet request window." [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [09:48:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P61821 and previous config saved to /var/cache/conftool/dbconfig/20240503-094805-marostegui.json [09:49:16] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092 (10cmooney) 03NEW p:05Triage→03Medium [09:50:59] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9766638 (10cmooney) [09:55:38] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9766653 (10ayounsi) Both Junos 22.2R3-Sx and Junos 22.4R3 are latest recommended. fyi, I went with 22.4R3 in magru. [09:57:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:02:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:03:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T361627)', diff saved to https://phabricator.wikimedia.org/P61822 and previous config saved to /var/cache/conftool/dbconfig/20240503-100313-marostegui.json [10:03:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2163.codfw.wmnet with reason: Maintenance [10:03:16] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:03:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2163.codfw.wmnet with reason: Maintenance [10:03:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T361627)', diff saved to https://phabricator.wikimedia.org/P61823 and previous config saved to /var/cache/conftool/dbconfig/20240503-100335-marostegui.json [10:12:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:13:02] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [10:14:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:14:25] (03PS1) 10Btullis: Add a ceph client for the dse-k8s container storage interface [puppet] - 10https://gerrit.wikimedia.org/r/1026819 (https://phabricator.wikimedia.org/T327259) [10:15:41] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2240/console" [puppet] - 10https://gerrit.wikimedia.org/r/1026819 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [10:15:59] !log installing Java 17 security updates on idp-test [10:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:30] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095 (10cmooney) 03NEW p:05Triage→03Medium [10:16:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9766721 (10cmooney) [10:16:53] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9766720 (10cmooney) [10:19:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:23:16] RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.45 ms [10:23:42] PROBLEM - ps1-a7-codfw-infeed-load-tower-B-phase-Y on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:23:42] PROBLEM - ps1-a7-codfw-infeed-load-tower-B-phase-Z on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:23:42] PROBLEM - ps1-a7-codfw-infeed-load-tower-B-phase-X on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:23:42] PROBLEM - ps1-a7-codfw-infeed-load-tower-A-phase-X on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:23:42] PROBLEM - ps1-a7-codfw-infeed-load-tower-A-phase-Y on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:23:43] PROBLEM - ps1-a7-codfw-infeed-load-tower-A-phase-Z on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:24:19] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:25:35] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097 (10cmooney) 03NEW p:05Triage→03Medium [10:25:48] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9766766 (10cmooney) [10:25:49] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766767 (10cmooney) [10:27:36] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on lsw1-a1-codfw,lsw1-a1-codfw IPv6,lsw1-a1-codfw.mgmt with reason: device being decommed and renamed, downtiming as a precaution first [10:27:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on lsw1-a1-codfw,lsw1-a1-codfw IPv6,lsw1-a1-codfw.mgmt with reason: device being decommed and renamed, downtiming as a precaution first [10:28:05] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766769 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b27eb80b-98ee-43fb-8026-b02b3e00b5d4) set by cmooney@cumin1002 for 14 days, 0:00:00 on 3 host(s) and their... [10:28:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T361627)', diff saved to https://phabricator.wikimedia.org/P61825 and previous config saved to /var/cache/conftool/dbconfig/20240503-102809-marostegui.json [10:28:13] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:29:31] (03PS1) 10Cathal Mooney: Remove entries for lsw1-a1-codfw and private1-a1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1026821 (https://phabricator.wikimedia.org/T364097) [10:29:40] PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:30:24] (03PS1) 10Muehlenhoff: Druid: overlord/historical/middlemanager: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1026822 [10:30:52] (03CR) 10CI reject: [V:04-1] Druid: overlord/historical/middlemanager: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1026822 (owner: 10Muehlenhoff) [10:31:24] (03CR) 10Majavah: [C:03+2] wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 (owner: 10Majavah) [10:32:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add bast7001 - jmm@cumin2002 - T364016" [10:32:06] T364016: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016 [10:33:18] (03PS3) 10Majavah: wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 [10:33:18] (03PS4) 10Majavah: wikireplicas: add-wiki: use runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822 [10:33:18] (03PS4) 10Majavah: wikireplicas: add-wiki: stop sourcing novaenv.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/1023823 [10:33:35] (03CR) 10Majavah: wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 (owner: 10Majavah) [10:33:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add bast7001 - jmm@cumin2002 - T364016" [10:34:49] (03PS1) 10Cathal Mooney: Remove lsw1-a1-codfw from production [homer/public] - 10https://gerrit.wikimedia.org/r/1026823 (https://phabricator.wikimedia.org/T364097) [10:35:00] (03PS1) 10Muehlenhoff: Make bast7001 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/1026824 (https://phabricator.wikimedia.org/T364016) [10:35:36] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766810 (10cmooney) Device has been removed from LiberNMS now. I also downtimed it for 2 weeks just in case I mess up the order of anything. [10:35:56] (03CR) 10Majavah: [C:03+2] wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 (owner: 10Majavah) [10:38:08] (03PS1) 10Marostegui: db1203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026846 [10:38:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1203', diff saved to https://phabricator.wikimedia.org/P61826 and previous config saved to /var/cache/conftool/dbconfig/20240503-103814-root.json [10:38:52] (03CR) 10Marostegui: [C:03+2] db1203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026846 (owner: 10Marostegui) [10:39:16] (03CR) 10Majavah: [C:03+2] wikireplicas: add-wiki: use runtime_description (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822 (owner: 10Majavah) [10:39:26] (03CR) 10Majavah: [C:03+2] wikireplicas: add-wiki: stop sourcing novaenv.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/1023823 (owner: 10Majavah) [10:39:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1203.eqiad.wmnet with OS bookworm [10:40:09] (03Merged) 10jenkins-bot: wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 (owner: 10Majavah) [10:40:51] (03CR) 10Ssingh: [C:03+1] Make bast7001 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/1026824 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff) [10:41:18] (03Abandoned) 10Cathal Mooney: Remove lsw1-a1-codfw from production [homer/public] - 10https://gerrit.wikimedia.org/r/1026823 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [10:41:34] 06SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9766840 (10MatthewVernon) I think I have two questions: # Where is it defined what should and shouldn't get its own intermediate? (e.g. I see cassandra has one) # Is ther... [10:41:40] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on doh7001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear, ssbd} https://wikitech.wikimedia.org/wiki/Microcode [10:42:06] (03CR) 10Muehlenhoff: [C:03+2] Make bast7001 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/1026824 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff) [10:42:41] (03PS1) 10Cathal Mooney: Remove lsw1-a1-codfw from EVPN RR cluster config [homer/public] - 10https://gerrit.wikimedia.org/r/1026847 [10:43:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P61827 and previous config saved to /var/cache/conftool/dbconfig/20240503-104317-marostegui.json [10:43:37] (03Merged) 10jenkins-bot: wikireplicas: add-wiki: use runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822 (owner: 10Majavah) [10:43:42] (03Merged) 10jenkins-bot: wikireplicas: add-wiki: stop sourcing novaenv.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/1023823 (owner: 10Majavah) [10:44:52] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766846 (10cmooney) [10:46:06] (03CR) 10Btullis: [V:03+1 C:03+2] Add a ceph client for the dse-k8s container storage interface [puppet] - 10https://gerrit.wikimedia.org/r/1026819 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [10:46:37] (03PS1) 10Marostegui: Revert "db1203: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026835 [10:47:06] (03PS2) 10Muehlenhoff: Druid: overlord/historical/middlemanager: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1026822 [10:47:26] (03CR) 10CI reject: [V:04-1] Druid: overlord/historical/middlemanager: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1026822 (owner: 10Muehlenhoff) [10:50:12] !log sukhe@cumin1002 START - Cookbook sre.ganeti.makevm for new host durum7002.magru.wmnet [10:50:13] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [10:50:22] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766856 (10cmooney) [10:50:58] (03PS2) 10Cathal Mooney: Remove lsw1-a1-codfw from EVPN RR cluster config [homer/public] - 10https://gerrit.wikimedia.org/r/1026847 (https://phabricator.wikimedia.org/T364097) [10:51:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow7001.magru.wmnet [10:52:12] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7002.magru.wmnet - sukhe@cumin1002" [10:52:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1203.eqiad.wmnet with reason: host reimage [10:53:05] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7002.magru.wmnet - sukhe@cumin1002" [10:53:05] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:53:05] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache durum7002.magru.wmnet on all recursors [10:53:08] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7002.magru.wmnet on all recursors [10:53:29] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7002.magru.wmnet - sukhe@cumin1002" [10:53:59] (03PS3) 10Muehlenhoff: Druid: historical/middlemanager: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1026822 [10:54:22] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7002.magru.wmnet - sukhe@cumin1002" [10:55:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1203.eqiad.wmnet with reason: host reimage [10:56:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow7001.magru.wmnet [10:57:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026822 (owner: 10Muehlenhoff) [10:58:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240503-105824-marostegui.json [10:58:47] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum7002.magru.wmnet with OS bookworm [10:58:54] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir7001.magru.wmnet [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240503T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: It is that lovely time of the day again! You are hereby commanded to deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240503T1100). [11:02:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir7001.magru.wmnet [11:05:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh7001.wikimedia.org [11:06:56] !log sukhe@cumin1002 START - Cookbook sre.ganeti.makevm for new host doh7002.wikimedia.org [11:06:57] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [11:07:26] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:54] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7002.wikimedia.org - sukhe@cumin1002" [11:09:09] (03CR) 10Ssingh: [C:03+1] "Looks good, thanks! Checked existing usage and compared it against Netbox info." [puppet] - 10https://gerrit.wikimedia.org/r/1026821 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [11:09:15] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on doh7001 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [11:09:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh7001.wikimedia.org [11:09:42] (03PS1) 10Majavah: wikireplicas: Sanitize logging_logindex target values [puppet] - 10https://gerrit.wikimedia.org/r/1026856 (https://phabricator.wikimedia.org/T363633) [11:09:47] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7002.wikimedia.org - sukhe@cumin1002" [11:09:47] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:09:47] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache doh7002.wikimedia.org on all recursors [11:09:50] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7002.wikimedia.org on all recursors [11:09:55] 06SRE-OnFire, 10Beta-Cluster-Infrastructure, 10logspam-watch, 10Sustainability (Incident Followup): (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors - https://phabricator.wikimedia.org/T315379#9766924 (10brennen) [11:10:18] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7002.wikimedia.org - sukhe@cumin1002" [11:10:56] (03CR) 10Marostegui: [C:03+2] Revert "db1203: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026835 (owner: 10Marostegui) [11:11:21] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7002.wikimedia.org - sukhe@cumin1002" [11:11:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61828 and previous config saved to /var/cache/conftool/dbconfig/20240503-111129-root.json [11:11:30] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:11:41] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host doh7002.wikimedia.org with OS bookworm [11:11:53] PROBLEM - Check whether ferm is active by checking the default input chain on mw1414 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:11:59] (03PS1) 10Majavah: wikireplicas: update-views: Add filter option [cookbooks] - 10https://gerrit.wikimedia.org/r/1026857 [11:12:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum7001.magru.wmnet [11:13:10] 06SRE, 06Infrastructure-Foundations, 10netops: Adjust IBGP route-reflector spine/leaf automation to support separate client clusters - https://phabricator.wikimedia.org/T364103 (10cmooney) 03NEW p:05Triage→03Medium [11:13:20] (03CR) 10Majavah: [C:03+2] wikireplicas: Sanitize logging_logindex target values [puppet] - 10https://gerrit.wikimedia.org/r/1026856 (https://phabricator.wikimedia.org/T363633) (owner: 10Majavah) [11:13:37] (03PS2) 10JMeybohm: New chart from scaffold: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026563 (https://phabricator.wikimedia.org/T362310) [11:13:37] (03PS1) 10JMeybohm: New version of base.certificates module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026859 (https://phabricator.wikimedia.org/T362310) [11:13:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T361627)', diff saved to https://phabricator.wikimedia.org/P61829 and previous config saved to /var/cache/conftool/dbconfig/20240503-111337-marostegui.json [11:13:40] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:13:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2164.codfw.wmnet with reason: Maintenance [11:13:40] (03PS1) 10JMeybohm: Make base.certificates compatible with chart modules and scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310) [11:13:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2164.codfw.wmnet with reason: Maintenance [11:13:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [11:14:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [11:14:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T361627)', diff saved to https://phabricator.wikimedia.org/P61830 and previous config saved to /var/cache/conftool/dbconfig/20240503-111415-marostegui.json [11:15:14] (03CR) 10CI reject: [V:04-1] New chart from scaffold: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026563 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [11:15:59] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.update-views [11:15:59] !log taavi@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=93) [11:16:11] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.update-views [11:16:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1203.eqiad.wmnet with OS bookworm [11:17:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum7001.magru.wmnet [11:17:26] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:46] (03PS3) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) [11:17:46] (03CR) 10JMeybohm: "What do you have in mind here? I made the chart not very configurable on purpose currently. Any particular cases that you thing need extra" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [11:18:40] FIRING: KubernetesRsyslogDown: rsyslog on mw1452:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1452 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:19:46] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [11:23:51] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7002.magru.wmnet with reason: host reimage [11:24:57] PROBLEM - Check whether ferm is active by checking the default input chain on parse1019 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:26:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61831 and previous config saved to /var/cache/conftool/dbconfig/20240503-112635-root.json [11:27:00] (03CR) 10Ayounsi: [C:03+1] Remove entries for lsw1-a1-codfw and private1-a1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1026821 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [11:27:05] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7002.magru.wmnet with reason: host reimage [11:27:24] (03CR) 10Ayounsi: [C:03+1] Remove lsw1-a1-codfw from EVPN RR cluster config [homer/public] - 10https://gerrit.wikimedia.org/r/1026847 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [11:32:26] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:32:44] (03PS1) 10Btullis: Make caps an optional parameter to the Ceph::Auth::ClientAuth type [puppet] - 10https://gerrit.wikimedia.org/r/1026867 (https://phabricator.wikimedia.org/T364105) [11:34:07] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2242/console" [puppet] - 10https://gerrit.wikimedia.org/r/1026867 (https://phabricator.wikimedia.org/T364105) (owner: 10Btullis) [11:36:05] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] wmp-laptop-sre: Add support for magru [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1025742 (owner: 10Muehlenhoff) [11:38:20] (03PS1) 10Muehlenhoff: Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1026871 [11:38:40] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh7002.wikimedia.org with reason: host reimage [11:39:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T361627)', diff saved to https://phabricator.wikimedia.org/P61832 and previous config saved to /var/cache/conftool/dbconfig/20240503-113924-marostegui.json [11:39:28] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:41:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61833 and previous config saved to /var/cache/conftool/dbconfig/20240503-114141-root.json [11:41:53] RECOVERY - Check whether ferm is active by checking the default input chain on mw1414 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:41:55] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7002.wikimedia.org with reason: host reimage [11:42:26] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:41] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:03] (03PS1) 10Muehlenhoff: Add node20 production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026873 (https://phabricator.wikimedia.org/T362681) [11:44:14] !log Removing connections from ssw1-a1-codfw and ssw1-a8-codfw to lsw1-a1-codfw T364097 [11:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:17] T364097: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097 [11:44:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Set up Ganeti clusters in magru - https://phabricator.wikimedia.org/T363978#9767109 (10MoritzMuehlenhoff) 05Open→03Resolved The two clusters (magru01 and magru02) are setup and initial VMs have been created already. [11:45:00] (03CR) 10Cathal Mooney: [C:03+2] Remove lsw1-a1-codfw from EVPN RR cluster config [homer/public] - 10https://gerrit.wikimedia.org/r/1026847 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [11:45:09] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7002.magru.wmnet with OS bookworm [11:45:10] !log sukhe@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7002.magru.wmnet [11:45:40] (03Merged) 10jenkins-bot: Remove lsw1-a1-codfw from EVPN RR cluster config [homer/public] - 10https://gerrit.wikimedia.org/r/1026847 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [11:47:26] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:51:44] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767123 (10cmooney) [11:53:00] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1026871 (owner: 10Muehlenhoff) [11:53:40] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:54:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P61834 and previous config saved to /var/cache/conftool/dbconfig/20240503-115431-marostegui.json [11:54:57] RECOVERY - Check whether ferm is active by checking the default input chain on parse1019 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:55:44] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove lsw1-a1-codfw phyiscal link dns - cmooney@cumin1002" [11:56:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61835 and previous config saved to /var/cache/conftool/dbconfig/20240503-115647-root.json [11:57:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove lsw1-a1-codfw phyiscal link dns - cmooney@cumin1002" [11:57:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:01:12] !log uploaded wmf-sre-laptop 0.5.10 to apt.wikimedia.org [12:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:30] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7002.wikimedia.org with OS bookworm [12:02:30] !log sukhe@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7002.wikimedia.org [12:04:24] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9767187 (10phaultfinder) [12:05:32] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9767189 (10phaultfinder) [12:06:00] (03CR) 10Cathal Mooney: [C:03+2] Remove entries for lsw1-a1-codfw and private1-a1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1026821 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [12:06:33] !log removing entries for lsw1-a1-codfw switch and private1-a1-codfw vlan from puppet T364097 [12:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:36] T364097: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097 [12:07:50] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767210 (10cmooney) [12:09:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P61837 and previous config saved to /var/cache/conftool/dbconfig/20240503-120938-marostegui.json [12:11:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61838 and previous config saved to /var/cache/conftool/dbconfig/20240503-121153-root.json [12:22:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:22:45] 06SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9767281 (10elukey) Hi! Trying to answer inline, Chris can chime in if I miss anything and/or if I write something totally off :) >>! In T356412#9766840, @MatthewVernon wrote:... [12:24:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:24:31] PROBLEM - Check whether ferm is active by checking the default input chain on parse1014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:24:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T361627)', diff saved to https://phabricator.wikimedia.org/P61839 and previous config saved to /var/cache/conftool/dbconfig/20240503-122446-marostegui.json [12:24:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2166.codfw.wmnet with reason: Maintenance [12:24:50] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [12:25:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2166.codfw.wmnet with reason: Maintenance [12:25:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T361627)', diff saved to https://phabricator.wikimedia.org/P61840 and previous config saved to /var/cache/conftool/dbconfig/20240503-122510-marostegui.json [12:26:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61841 and previous config saved to /var/cache/conftool/dbconfig/20240503-122659-root.json [12:27:26] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:29:58] FIRING: [17x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:30:38] FIRING: MXQueueNoMetrics: Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics [12:30:38] FIRING: FNMNotReported: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [12:30:38] FIRING: [10x] ProbeDown: Service mathoid:4001 has failed probes (http_mathoid_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:31:17] here [12:31:36] cwhite: ? [12:32:26] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:33:44] fabfur: herron ^ [12:33:54] FIRING: [21x] ProbeDown: Service t-b-pki-01:443 has failed probes (http_PKI_aux_front_proxy_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#t-b-pki-01:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:33:54] FIRING: [9x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:34:37] so many alerts, I don't know where it's starting [12:35:38] FIRING: [45x] ProbeDown: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:35:47] probably a ganeti issue? looks quite random [12:35:50] hm [12:36:27] could be network as well [12:36:37] or new prometheus box in magru? [12:37:00] or alert infra [12:37:26] FIRING: [7x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install7001.wikimedia.org [12:39:18] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:39:34] things are working when I test them, the graphs don't show a sign they are down but a lot of probes are failing [12:42:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61842 and previous config saved to /var/cache/conftool/dbconfig/20240503-124204-root.json [12:42:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:42:41] our team dashboard for alertmanager also has quite a lot of random alerts and linting problems [12:43:37] this is unrelated https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&var-site=All&var-deployment=mw-parsoid&var-method=GET&var-code=200&var-handler=php&var-service=mediawiki&viewPanel=63 [12:44:49] (03CR) 10Jforrester: "Should there be a -devel image too (with npm in it)?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026873 (https://phabricator.wikimedia.org/T362681) (owner: 10Muehlenhoff) [12:46:50] (03PS4) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) [12:47:12] herron: the probes are failing and triggering a cascade of pages [12:47:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:47:26] FIRING: [8x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:47] PROBLEM - Check whether ferm is active by checking the default input chain on mw1416 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:47:48] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:48:25] (03PS1) 10Muehlenhoff: squid_exporter: Remove some outdated comments [puppet] - 10https://gerrit.wikimedia.org/r/1026910 [12:50:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T361627)', diff saved to https://phabricator.wikimedia.org/P61843 and previous config saved to /var/cache/conftool/dbconfig/20240503-125015-marostegui.json [12:50:24] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [12:50:41] hmmm network on eqiad isn't happy [12:50:43] https://grafana.wikimedia.org/goto/tUvFRmLSg?orgId=1 [12:51:05] PROBLEM - Check whether ferm is active by checking the default input chain on mw2430 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:51:14] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:51:39] that would explain the ferm issues? [12:51:44] it looks like an IPv6 issue (eqiad->eqiad ICMP latency is OK for IPv4 but all over the place for IPv6) [12:52:26] FIRING: [8x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:38] topranks: is anything going on with cr1-eqiad? [12:52:47] PROBLEM - Check whether ferm is active by checking the default input chain on mw1397 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:52:55] (03PS1) 10Cathal Mooney: Remove include statement for old private1-a1-codfw range [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) [12:53:13] vgutierrez: I hope not [12:53:29] topranks: we have like 100 probes failure pages basicall [12:53:35] https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:01] but things are working [12:54:06] soooo [12:54:07] * topranks looking [12:54:10] (03CR) 10CI reject: [V:04-1] Remove include statement for old private1-a1-codfw range [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [12:54:31] RECOVERY - Check whether ferm is active by checking the default input chain on parse1014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:54:53] vgutierrez: any specifics on that ivp6 latency? [12:55:00] you looking at a graph or doing a particular test? [12:55:05] (03CR) 10Muehlenhoff: "Not sure, we didn't do this for the previous images based on nodesource debs neither (node14/node16), and I don't think nodesource ships n" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026873 (https://phabricator.wikimedia.org/T362681) (owner: 10Muehlenhoff) [12:55:10] topranks: grafana link I shared above [12:55:16] sry [12:55:17] topranks: https://grafana.wikimedia.org/goto/I8WfgiYSR?orgId=1 [12:55:24] I have to step out [12:55:28] (03PS2) 10Muehlenhoff: squid_exporter: Remove some outdated comments [puppet] - 10https://gerrit.wikimedia.org/r/1026910 [12:57:26] FIRING: [8x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:40] RESOLVED: KubernetesRsyslogDown: rsyslog on mw1452:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1452 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:58:57] it looks like 8503219001 by topranks triggered the ferm rules update fleet wide [12:59:18] ok... [12:59:32] that patch was to remove an unused subnet from puppet defs [12:59:33] aka dropping private-a1-codfw [12:59:55] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9767450 (10CDanis) magru is a clear win for: UY, CL, AR, BR, PY It's better for some but not all users in: BO, PE {F49974214} [13:00:52] topranks: yep.. those ranges are exposed on the hosts with ferm on /etc/ferm/conf.d/00_defs [13:00:59] I can't see anything wrong with the network, or reproduce any at a network level [13:01:26] yeah, I guess not good if a change there results in this [13:01:38] but I suppose the answer is nftables? [13:02:00] I would have thought it'd be a rolling change as hosts run the puppet agent at different times [13:02:26] FIRING: [8x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:48] the probe down alerts are listing "14 hours ago" that is a little confusing [13:02:51] this is the ferm error I'm seeing https://www.irccloud.com/pastebin/ccdNCSXy/ [13:03:04] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [13:03:54] vgutierrez: hmm ok, so the restart failed [13:04:27] yup, dunno if that would trigger a weird state where the default policy is still DROP and no rules are present [13:04:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:04:48] hopefully not :) [13:05:10] yeah that would definitely not be a good way for it to operate [13:05:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P61844 and previous config saved to /var/cache/conftool/dbconfig/20240503-130523-marostegui.json [13:05:27] looking at that particular host, mw1397, the old private1-a1-codfw range is not in the ruleset anymore [13:05:38] so it seems to have managed to reload after that? [13:05:47] yep, on the next puppet run [13:05:50] 30 minutes later [13:06:22] we have seen some of these transient ferm reload issues before as well. usually on a few mw* hosts but nothing beyond them [13:06:32] yeah exactly, 30 mins later [13:06:35] hmm [13:06:51] folks one thing that I am not getting - why do we have probe downs still present in alerts.w.o? [13:07:09] * vgutierrez looking [13:07:30] they have not recovered, things seems to work but it is rather worrying [13:07:39] also they have 14 hours ago marks [13:07:58] elukey: I got the page for them half an hour ago [13:08:04] !incidents [13:08:05] 4650 (ACKED) [10x] ProbeDown sre (probes/service eqiad) [13:08:05] 4649 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [13:08:05] 4648 (RESOLVED) db1175 (paged)/MariaDB Replica SQL: s3 (paged) [13:08:05] 4647 (RESOLVED) db1189 (paged)/MariaDB Replica SQL: s3 (paged) [13:08:16] are there still hosts with ferm in state failed? [13:08:36] they are everywhere [13:08:41] i.e. we seen on mw1397 it took till next puppet run to restart the service (and presumably restore correct ruleset) [13:08:50] https://puppetboard.wikimedia.org/failures [13:08:54] just mw1416 [13:09:14] and even looks good there now [13:09:44] the k8s ones are auto-correcting via a systemd timer [13:10:22] https://phabricator.wikimedia.org/T354855 [13:10:42] FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:10:58] that's not good [13:11:30] that's expected because of magr provisioning but we get alerted for other sites as well which we should not [13:11:35] swfrench-wmf filed a task for that [13:11:40] https://phabricator.wikimedia.org/T363924 [13:11:49] s/magr/magru [13:12:05] same story on mw1416 [13:12:12] once we reimage ncredir7002 today, I will do a cleanup of the confd state [13:12:20] ferm failed at 12:38, next puppet run at 13:08 restarted it clean [13:12:26] FIRING: [7x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:36] (03PS1) 10Slyngshede: LDAP Eventlog [software/bitu] - 10https://gerrit.wikimedia.org/r/1026919 (https://phabricator.wikimedia.org/T163478) [13:13:29] elukey: so.. using the alert query link option on alerts.w.o shows a query with an empty result... https://prometheus-eqiad.wikimedia.org/ops/graph?g0.expr=%28avg_over_time%28probe_success%7Bjob%3D~%22probes%2F.%2A%22%2Cmodule%3D~%22%28http%7Ctcp%29.%2A%22%7D%5B1m%5D%29+and+on+%28instance%29+service_catalog_page+%3D%3D+1%29+%2A+100+%3C+10&g0.tab=1 [13:13:39] (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [13:13:42] elukey: so those could be stale alerts on klaxon? [13:14:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:15:38] vgutierrez: o/ sorry I am missing something, what is the relationship with klaxon and the alerts in https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DProbeDown ? [13:16:11] (03PS2) 10Cathal Mooney: Remove include statement for old private1-a1-codfw range [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) [13:16:20] elukey: brain fart.. what's the name of the UI interface running on alerts.w.o? [13:16:21] (03CR) 10Ssingh: "; private1-a1-codfw (2620:0:860:105::/64)" [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [13:16:28] karma [13:16:29] and why it's paging now instead of 14 hours ago [13:16:31] thanks [13:16:38] s/klaxon/karma/ :) [13:16:54] ahhh okok sorry makes sense! [13:17:05] never checked karma, lemme see [13:17:26] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:17:42] !incidents [13:17:43] 4650 (ACKED) [10x] ProbeDown sre (probes/service eqiad) [13:17:43] 4649 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [13:17:43] 4648 (RESOLVED) db1175 (paged)/MariaDB Replica SQL: s3 (paged) [13:17:43] 4647 (RESOLVED) db1189 (paged)/MariaDB Replica SQL: s3 (paged) [13:17:47] RECOVERY - Check whether ferm is active by checking the default input chain on mw1416 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:20:06] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [13:20:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P61845 and previous config saved to /var/cache/conftool/dbconfig/20240503-132030-marostegui.json [13:20:35] (03CR) 10Ssingh: [C:03+1] "I think it's time to admit that you enjoy messing with v6 PTRs 😊" [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [13:21:03] RECOVERY - Check whether ferm is active by checking the default input chain on mw2430 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:21:11] (03CR) 10Cathal Mooney: [C:03+2] Remove include statement for old private1-a1-codfw range [dns] - 10https://gerrit.wikimedia.org/r/1026911 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [13:21:41] so far nothing weird in the karma logs [13:21:50] (TIL that the alerts' UI is called karma) [13:22:07] checking on logstash the logs for the citoid probe I can see a 503 around the time where the dashboard flags the last citoid issue [13:22:21] May 3, 2024 @ 13:13:44.170 prometheus1006 target=https://[10.2.2.19]:4003/_info msg="Received HTTP response" status_code=503 [13:22:26] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [13:22:47] RECOVERY - Check whether ferm is active by checking the default input chain on mw1397 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:23:53] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:03] I see puppet running at around 12:17 UTC on prometheus1006 [13:24:47] (03PS4) 10Dreamrimmer: Enable 'flood' user group at en.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019822 (https://phabricator.wikimedia.org/T351250) [13:24:58] are the logs all on promotehus1006? [13:24:59] anything against me restarting karma? [13:25:09] Amir1: yep [13:25:12] stupid q: Where can you see the logs? [13:25:16] Amir1: https://logstash.wikimedia.org/goto/978e97b1d6f9c475d4d6bc8a6065752f [13:25:24] ah thanks [13:25:34] the logstash dashboard is linked on the grafana dashboard BTW [13:25:47] oh.. not just 1006, 1005 as well [13:26:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install7001.wikimedia.org - jmm@cumin2002" [13:26:59] !log restart karma on alert1001 to verify if probe down alerts shown are stale [13:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:31] nope, same thing [13:27:42] (03CR) 10CDanis: [C:03+1] Stop supporting sslcert in Profile::Pki::Provider type [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:27:56] for netbox it's the same issue, some 503s [13:28:28] (03CR) 10Bking: [C:03+1] elasticsearch: Remove support for sslcert SSL provider [puppet] - 10https://gerrit.wikimedia.org/r/1026803 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [13:28:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install7001.wikimedia.org - jmm@cumin2002" [13:28:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:28:41] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install7001.wikimedia.org on all recursors [13:28:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install7001.wikimedia.org on all recursors [13:29:07] so no networking connectivity issues between the probes and the service itself but L7 errors [13:29:13] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install7001.wikimedia.org - jmm@cumin2002" [13:30:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install7001.wikimedia.org - jmm@cumin2002" [13:30:57] just pinged o11y folks on -observability.. maybe they can help :) [13:31:22] vgutierrez: looking at, for instance that citoid one, the IP is announced by LVS [13:31:44] topranks: yeah... it's targetting citoid.svc.eqiad.wmnet [13:31:54] is the issue maybe LVS using v6 to the back-end and failing due to the ferm issue? [13:32:55] topranks: LVS doesn't perform v4->v6 [13:33:04] v4 VIPs have v4 real servers [13:33:15] yeah brain fart it just writes the L2 header [13:33:18] indeed [13:33:19] yep yep [13:33:30] !on-call [13:33:41] I'm seeing occasional 503 with curl -v https://citoid.discovery.wmnet:4003/_info from the prom host, maybe 1/5 tries? [13:34:31] herron: yeah.. but that shouldn't trigger a p.a.g.e, right? [13:34:48] herron: why karma is showing citoid as paging since 14h ago? [13:34:57] yes it is very confusing [13:35:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T361627)', diff saved to https://phabricator.wikimedia.org/P61846 and previous config saved to /var/cache/conftool/dbconfig/20240503-133538-marostegui.json [13:35:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2167.codfw.wmnet with reason: Maintenance [13:35:42] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:35:53] I have to eat something [13:35:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2167.codfw.wmnet with reason: Maintenance [13:36:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T361627)', diff saved to https://phabricator.wikimedia.org/P61847 and previous config saved to /var/cache/conftool/dbconfig/20240503-133601-marostegui.json [13:36:03] will be back [13:36:12] Amir1: talk to fabfur, he can patch your firmware [13:36:22] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9767646 (10elukey) Status: Lift Wing codfw has been migrated successfully, we are going to do eqiad on Monday 6th. [13:40:01] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767654 (10cmooney) [13:41:06] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767659 (10cmooney) a:03Papaul @papaul I think this one is ready to be moved to rack D1 now. [13:41:31] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767661 (10cmooney) [13:43:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install7001.wikimedia.org with OS bookworm [13:43:29] (03PS1) 10Elukey: Move mw-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) [13:45:12] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2243/console" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [13:46:09] (03CR) 10Muehlenhoff: "Filename should be ms-fe1009, not mw-fe1009" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [13:46:20] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9767687 (10elukey) I have also reviewed the non-cpXXXX IPs found in netstat on ms-fe nodes, they seem all belonging to the thumbor pods, that should be u... [13:47:16] (03PS1) 10Cathal Mooney: Remove lsw1-a1-codfw from homer vars [homer/public] - 10https://gerrit.wikimedia.org/r/1026928 (https://phabricator.wikimedia.org/T364097) [13:48:26] (03CR) 10Muehlenhoff: Move mw-fe1009's envoy TLS cert to PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [13:49:24] (03PS2) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) [13:49:46] (03CR) 10Elukey: "Yes yes PEBCAK, I was puzzled that PCC showed no changes :D" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [13:51:16] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2244/co" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [13:51:52] (03PS3) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) [13:53:16] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2245/co" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [13:54:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [13:54:38] (03CR) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [13:57:06] topranks: re https://gerrit.wikimedia.org/r/1026928 [13:57:14] I am going to be running homer shortly in case you want it to be merged [13:58:14] sukhe: thanks, that ones not urgent [13:58:27] it won't cause any network changes once merged anyway, just tidy-up [13:58:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T361627)', diff saved to https://phabricator.wikimedia.org/P61848 and previous config saved to /var/cache/conftool/dbconfig/20240503-135834-marostegui.json [13:58:39] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:58:44] ok. so I should feel free to merge this? asking because I will run homer for the durum/doh hosts! [13:59:43] sure feel free to +1 for me, it's safe anyway [14:00:39] thanks [14:00:52] (03PS1) 10Elukey: role::swift::proxy: simplify hiera configuration for the tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) [14:00:59] (03CR) 10Ssingh: [C:03+1] Remove lsw1-a1-codfw from homer vars [homer/public] - 10https://gerrit.wikimedia.org/r/1026928 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [14:02:26] (03CR) 10Cathal Mooney: [C:03+2] Remove lsw1-a1-codfw from homer vars [homer/public] - 10https://gerrit.wikimedia.org/r/1026928 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [14:02:37] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:03:18] (03PS2) 10Elukey: role::swift::proxy: simplify hiera configuration for the tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) [14:03:18] (03PS4) 10Elukey: Move ms-fe1009's envoy TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) [14:04:29] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9767769 (10phaultfinder) [14:04:33] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2247/console" [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:04:57] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767770 (10cmooney) [14:05:44] (03Merged) 10jenkins-bot: Remove lsw1-a1-codfw from homer vars [homer/public] - 10https://gerrit.wikimedia.org/r/1026928 (https://phabricator.wikimedia.org/T364097) (owner: 10Cathal Mooney) [14:07:05] (03CR) 10Elukey: "Hi folks! Sorry for the broad ping but better safe than sorry :)" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:07:15] !log alert1001:~# systemctl restart prometheus-alertmanager.service [14:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:33] (03CR) 10Elukey: "@Matthew: my idea would be to depool ms-fe1009, apply the change, ask Traffic to double check, we double check, and then we repool and obs" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:08:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install7001.wikimedia.org with reason: host reimage [14:09:02] (03CR) 10Elukey: [V:03+1] "No op as expected, but please double check that I haven't missed anything important." [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:10:25] herron: that restart of alertmanager got rid of the stale alerts? [14:10:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:11:19] (03CR) 10MVernon: [C:03+1] "This seems sensible to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1026937 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:11:25] vgutierrez: yeah, looking better now [14:11:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install7001.wikimedia.org with reason: host reimage [14:11:36] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9767822 (10CDanis) >>! In T356412#9766840, @MatthewVernon wrote: > I think I have two questions: > > # Where is it defined what should and shouldn't g... [14:12:48] !incidents [14:12:49] 4650 (ACKED) [10x] ProbeDown sre (probes/service eqiad) [14:12:49] 4649 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [14:12:50] (03PS1) 10Jdrewniak: [Vector 2022] Test night mode disabled on mainpage on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026939 (https://phabricator.wikimedia.org/T362911) [14:13:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P61849 and previous config saved to /var/cache/conftool/dbconfig/20240503-141341-marostegui.json [14:14:18] (03PS2) 10Jdrewniak: [Vector 2022] Test night mode disabled on mainpage on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026939 (https://phabricator.wikimedia.org/T362911) [14:14:51] (03CR) 10MVernon: "I think mediawiki nodes also talk to the frontends, for uploads and so on?" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:14:54] !log sudo homer asw*magru* commit "add durum and doh hosts in magru" [14:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:27] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [14:15:36] I'll manually resolve 4650 so it doesn't retrigger tomorrow [14:15:41] !resolve 4650 [14:15:42] 4650 (ACKED) [10x] ProbeDown sre (probes/service eqiad) [14:16:32] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9767843 (10MatthewVernon) OK, I think I am convinced that this should go ahead. Thanks for your patience :) [14:16:48] * herron resolved it via the app instead [14:16:53] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:17:07] (03PS1) 10Muehlenhoff: Deprecate system::role for Cassandra services [puppet] - 10https://gerrit.wikimedia.org/r/1026940 [14:18:18] (03PS2) 10Milimetric: Update commons impact metrics readme [puppet] - 10https://gerrit.wikimedia.org/r/1026597 (https://phabricator.wikimedia.org/T358701) [14:19:31] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1026867 (https://phabricator.wikimedia.org/T364105) (owner: 10Btullis) [14:19:36] (03CR) 10Bking: [C:03+2] Update commons impact metrics readme [puppet] - 10https://gerrit.wikimedia.org/r/1026597 (https://phabricator.wikimedia.org/T358701) (owner: 10Milimetric) [14:20:25] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:16] (03CR) 10Mabualruz: [C:03+1] [Vector 2022] Test night mode disabled on mainpage on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026939 (https://phabricator.wikimedia.org/T362911) (owner: 10Jdrewniak) [14:22:19] (03CR) 10Elukey: "Definitely yes, forgot to mention those. They use envoy as sidecar proxy (both bare metal and k8s) so it should be the same assumption tha" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:23:17] Is here an issue with loading? [14:23:21] there* [14:24:14] nevermind it was only brief. Weird. [14:25:47] (03PS1) 10Hnowlan: kubernetes: add 6 codfw appservers as workers [puppet] - 10https://gerrit.wikimedia.org/r/1026941 (https://phabricator.wikimedia.org/T351074) [14:26:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install7001.wikimedia.org with OS bookworm [14:26:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install7001.wikimedia.org [14:26:56] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add install7001 - jmm@cumin2002" [14:27:26] herron: o/ so the probe down errors were cleared restarting alertmanager? [14:28:40] elukey yes although I'm not sure yet what led to that state [14:28:44] (03CR) 10Vgutierrez: [C:03+1] "looking good, commit matches what we currently see on swift.discovery.wmnet:" [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:28:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P61850 and previous config saved to /var/cache/conftool/dbconfig/20240503-142848-marostegui.json [14:29:07] herron: super thanks, just to know how to fix in case it re-happens [14:30:07] (03PS1) 10Ssingh: hiera: update installserver for magru [puppet] - 10https://gerrit.wikimedia.org/r/1026944 (https://phabricator.wikimedia.org/T346722) [14:31:26] (03PS1) 10Ssingh: sites: update installserver for magru [homer/public] - 10https://gerrit.wikimedia.org/r/1026945 (https://phabricator.wikimedia.org/T346722) [14:34:35] (03CR) 10MVernon: [C:03+1] "That seems a reasonable approach to me, thanks. NB I'm OOO on Monday 6th." [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:38:06] (03CR) 10Elukey: "Super let's sync for Tuesday if you want, we can ping each other on IRC and see if we have time to do it. I'll work only in the afternoon," [puppet] - 10https://gerrit.wikimedia.org/r/1026927 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [14:39:16] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@5d3a06d] (releasing): test plugin update in secondary host [14:39:38] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@5d3a06d] (releasing): test plugin update in secondary host (duration: 00m 22s) [14:40:12] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:02] (03Abandoned) 10Jdrewniak: [Vector 2022] Test night mode disabled on mainpage on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026939 (https://phabricator.wikimedia.org/T362911) (owner: 10Jdrewniak) [14:43:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T361627)', diff saved to https://phabricator.wikimedia.org/P61851 and previous config saved to /var/cache/conftool/dbconfig/20240503-144356-marostegui.json [14:43:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2181.codfw.wmnet with reason: Maintenance [14:43:59] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:44:02] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@5d3a06d] (releasing): update plugins to address vulnerabilities [14:44:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2181.codfw.wmnet with reason: Maintenance [14:44:17] (03CR) 10Ayounsi: [C:03+1] sites: update installserver for magru [homer/public] - 10https://gerrit.wikimedia.org/r/1026945 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:44:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T361627)', diff saved to https://phabricator.wikimedia.org/P61852 and previous config saved to /var/cache/conftool/dbconfig/20240503-144419-marostegui.json [14:44:42] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@5d3a06d] (releasing): update plugins to address vulnerabilities (duration: 00m 39s) [14:45:25] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add install7001 - jmm@cumin2002" [14:51:22] 06SRE, 06Traffic, 10Data Products (Data Products Sprint 13): Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9768015 (10VirginiaPoundstone) [14:52:31] 06SRE, 06Data Products, 06Traffic: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9768036 (10VirginiaPoundstone) [14:52:41] (03PS1) 10Ahmon Dancy: Use buildkit wmf-v0.13.2-1 on WMCS and trusted runners (v2) [puppet] - 10https://gerrit.wikimedia.org/r/1026949 (https://phabricator.wikimedia.org/T364013) [14:53:04] 06SRE, 06Data Products, 06Traffic: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9768033 (10VirginiaPoundstone) Once https://phabricator.wikimedia.org/T351117 is complete, this may need a spike to check if issue persists. [14:57:48] (03PS1) 10Bking: elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) [14:58:09] (03CR) 10CI reject: [V:04-1] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [14:58:25] (03CR) 10Elukey: [C:03+1] modules: Add restrictedSecurityContext to statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026556 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:59:38] (03CR) 10JMeybohm: [C:03+2] modules: Add restrictedSecurityContext to statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026556 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:59:41] (03CR) 10JMeybohm: [C:03+2] New version of statds module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026555 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:59:43] (03PS1) 10Elukey: amd/pytorch21: update ROCm drivers to 5.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026952 (https://phabricator.wikimedia.org/T362984) [15:00:11] (03PS2) 10Bking: elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) [15:00:12] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:31] (03CR) 10CI reject: [V:04-1] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [15:00:38] (03Merged) 10jenkins-bot: New version of statds module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026555 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [15:00:40] (03Merged) 10jenkins-bot: modules: Add restrictedSecurityContext to statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026556 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [15:01:49] (03PS3) 10Bking: elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) [15:01:50] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026952 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [15:02:08] (03CR) 10CI reject: [V:04-1] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [15:07:42] (03CR) 10DannyS712: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026691 (https://phabricator.wikimedia.org/T364039) (owner: 10Superzerocool) [15:08:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T361627)', diff saved to https://phabricator.wikimedia.org/P61853 and previous config saved to /var/cache/conftool/dbconfig/20240503-150846-marostegui.json [15:08:50] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:11:30] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:14:29] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9768112 (10phaultfinder) [15:14:46] (03PS4) 10Bking: elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) [15:17:46] (03PS1) 10Elukey: kserve-inference: add securityContext explicit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) [15:17:51] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:18:45] (03CR) 10CI reject: [V:04-1] kserve-inference: add securityContext explicit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [15:21:30] RESOLVED: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P61854 and previous config saved to /var/cache/conftool/dbconfig/20240503-152354-marostegui.json [15:23:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:26:45] !log depooled wdqs1012 (lagged) [15:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:23] (03CR) 10Ebernhardson: [C:03+1] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [15:29:11] (03CR) 10Gehel: [C:03+1] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [15:29:43] (03CR) 10Bking: [C:03+2] elastic: remove backend failure check [puppet] - 10https://gerrit.wikimedia.org/r/1026950 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [15:31:44] (03PS1) 10Cathal Mooney: Add VM BGP for esams/drmrs/magru back to YAML for now [homer/public] - 10https://gerrit.wikimedia.org/r/1026956 (https://phabricator.wikimedia.org/T362421) [15:32:42] (03CR) 10Jforrester: [C:03+1] wikifunctions: Allow prometheus to scrape metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026441 (https://phabricator.wikimedia.org/T350034) (owner: 10JMeybohm) [15:33:21] (03CR) 10JHathaway: [C:03+1] "awesome, nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/1026804 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [15:33:29] (03CR) 10Cathal Mooney: [C:04-1] "Don't merge - will remove peerings to physical servers like dns3003!" [homer/public] - 10https://gerrit.wikimedia.org/r/1026956 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [15:33:43] (03CR) 10Bking: [C:03+1] Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1026439 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [15:33:54] (03CR) 10Bking: [C:03+1] Remove obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1026438 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [15:34:34] !log brett@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir7002.magru.wmnet [15:34:36] !log brett@cumin2002 START - Cookbook sre.dns.netbox [15:39:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P61855 and previous config saved to /var/cache/conftool/dbconfig/20240503-153901-marostegui.json [15:39:04] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9768239 (10Gehel) [15:39:12] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7002.magru.wmnet - brett@cumin2002" [15:40:06] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7002.magru.wmnet - brett@cumin2002" [15:40:06] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:40:06] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir7002.magru.wmnet on all recursors [15:40:09] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7002.magru.wmnet on all recursors [15:40:23] 06SRE-OnFire, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 03Discovery-Search (Current work), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694#9768264 (10Gehel) p:05Triage→03High [15:40:33] 06SRE-OnFire, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 03Discovery-Search (Current work), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694#9768267 (10Gehel) [15:41:02] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9768259 (10Gehel) [15:41:45] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7002.magru.wmnet - brett@cumin2002" [15:42:39] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7002.magru.wmnet - brett@cumin2002" [15:48:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:54:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T361627)', diff saved to https://phabricator.wikimedia.org/P61856 and previous config saved to /var/cache/conftool/dbconfig/20240503-155409-marostegui.json [15:54:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2195.codfw.wmnet with reason: Maintenance [15:54:13] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:54:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2195.codfw.wmnet with reason: Maintenance [15:54:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T361627)', diff saved to https://phabricator.wikimedia.org/P61857 and previous config saved to /var/cache/conftool/dbconfig/20240503-155432-marostegui.json [16:00:14] (03CR) 10Scott French: [C:03+1] kubernetes: add 6 codfw appservers as workers [puppet] - 10https://gerrit.wikimedia.org/r/1026941 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [16:01:10] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9768394 (10CDanis) Oh, and I think magru is a win for SV as well. [16:02:18] (03CR) 10Klausman: [C:03+1] amd/pytorch21: update ROCm drivers to 5.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026952 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [16:02:39] (03CR) 10Elukey: [V:03+2 C:03+2] amd/pytorch21: update ROCm drivers to 5.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026952 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [16:04:47] (03PS1) 10Btullis: Add the wmf-java-cacerts truststore to all remaining airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026964 (https://phabricator.wikimedia.org/T362181) [16:05:28] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:06:51] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026964 (https://phabricator.wikimedia.org/T362181) (owner: 10Btullis) [16:07:10] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9768411 (10phaultfinder) [16:12:03] (03CR) 10Aklapper: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1024348 (https://phabricator.wikimedia.org/T362804) (owner: 10Aklapper) [16:14:31] RECOVERY - ps1-a7-codfw-infeed-load-tower-A-phase-Z on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-A-phase-Z 377 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:31] RECOVERY - ps1-a7-codfw-infeed-load-tower-A-phase-Y on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-A-phase-Y 168 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:31] RECOVERY - ps1-a7-codfw-infeed-load-tower-B-phase-Y on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-B-phase-Y 207 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:31] RECOVERY - ps1-a7-codfw-infeed-load-tower-A-phase-X on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-A-phase-X 388 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:31] RECOVERY - ps1-a7-codfw-infeed-load-tower-B-phase-X on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-B-phase-X 357 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:32] RECOVERY - ps1-a7-codfw-infeed-load-tower-B-phase-Z on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-B-phase-Z 368 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:32] RECOVERY - Host lsw1-a7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.71 ms [16:14:33] RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.02 ms [16:15:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T361627)', diff saved to https://phabricator.wikimedia.org/P61858 and previous config saved to /var/cache/conftool/dbconfig/20240503-161531-marostegui.json [16:15:35] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:16:54] (03CR) 10JHathaway: [C:04-1] "I think this is the correct solution after resolving one inline question." [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott) [16:17:50] (03CR) 10JHathaway: "I think this change can be abandoned in favor of, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026682" [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott) [16:18:29] (03CR) 10Elukey: [C:03+1] Add the wmf-java-cacerts truststore to all remaining airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026964 (https://phabricator.wikimedia.org/T362181) (owner: 10Btullis) [16:18:52] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir7002.magru.wmnet with OS bookworm [16:19:33] RECOVERY - ps1-b8-codfw-infeed-load-tower-A-phase-X on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-A-phase-X 585 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:33] RECOVERY - ps1-b8-codfw-infeed-load-tower-A-phase-Z on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-A-phase-Z 312 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:33] RECOVERY - ps1-b8-codfw-infeed-load-tower-B-phase-Z on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-B-phase-Z 299 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:33] RECOVERY - ps1-b8-codfw-infeed-load-tower-A-phase-Y on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-A-phase-Y 278 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:33] RECOVERY - ps1-b8-codfw-infeed-load-tower-B-phase-Y on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-B-phase-Y 248 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:34] RECOVERY - ps1-b8-codfw-infeed-load-tower-B-phase-X on ps1-b8-codfw is OK: SNMP OK - ps1-b8-codfw-infeed-load-tower-B-phase-X 580 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:35] RECOVERY - Host ps1-b8-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.25 ms [16:19:37] RECOVERY - Host lsw1-b8-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.67 ms [16:24:03] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364061#9768463 (10Papaul) 05Open→03Resolved a:03Papaul Resolved by rebooting both switches [16:30:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P61859 and previous config saved to /var/cache/conftool/dbconfig/20240503-163039-marostegui.json [16:34:56] (03PS1) 10Jsn.sherman: Add AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026972 (https://phabricator.wikimedia.org/T364034) [16:34:58] (03PS1) 10Jsn.sherman: Deploy AutoModerator to Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026973 (https://phabricator.wikimedia.org/T364034) [16:34:59] (03PS1) 10Jsn.sherman: Add AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026974 (https://phabricator.wikimedia.org/T364034) [16:35:01] (03PS1) 10Jsn.sherman: CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) [16:35:57] (03PS1) 10Elukey: Remove golang 1.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976 [16:36:43] (03CR) 10CI reject: [V:04-1] CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [16:37:01] (03CR) 10Elukey: "I don't see any production-image with depend-on golang1.14, also this doesn't remove it from the docker registry so it should be safe. Lem" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976 (owner: 10Elukey) [16:37:07] (03CR) 10Dzahn: [C:03+1] phabricator: increase phabricator page delay to 4m [puppet] - 10https://gerrit.wikimedia.org/r/1026801 (https://phabricator.wikimedia.org/T362401) (owner: 10Jelto) [16:37:28] (03PS1) 10JHathaway: WIP: puppetdb: remove unused hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1026977 [16:39:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026977 (owner: 10JHathaway) [16:44:13] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7002.magru.wmnet with reason: host reimage [16:45:07] (03CR) 10Dzahn: [C:03+2] phabricator: increase phabricator page delay to 4m [puppet] - 10https://gerrit.wikimedia.org/r/1026801 (https://phabricator.wikimedia.org/T362401) (owner: 10Jelto) [16:45:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P61860 and previous config saved to /var/cache/conftool/dbconfig/20240503-164546-marostegui.json [16:46:59] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7002.magru.wmnet with reason: host reimage [16:47:04] (03PS1) 10Hoo man: Remove Cognate virtual domain mapping b/c code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026982 (https://phabricator.wikimedia.org/T348526) [16:47:14] (03PS2) 10JHathaway: puppetdb: remove unused hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1026977 (https://phabricator.wikimedia.org/T337970) [16:48:05] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026977 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [16:49:20] (03CR) 10Dzahn: [C:03+2] delete cert for query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1026622 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [16:51:42] (03CR) 10Dzahn: [C:03+1] "lgtm, removed from DNS in https://gerrit.wikimedia.org/r/c/operations/dns/+/884276" [puppet] - 10https://gerrit.wikimedia.org/r/1026797 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff) [17:00:14] (03PS1) 10Dzahn: delete civicrm-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1026986 [17:00:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T361627)', diff saved to https://phabricator.wikimedia.org/P61862 and previous config saved to /var/cache/conftool/dbconfig/20240503-170054-marostegui.json [17:00:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [17:01:00] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:01:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [17:04:15] 06SRE, 06Infrastructure-Foundations, 10probenet, 06Traffic: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318#9768653 (10CDanis) [17:04:42] 06SRE, 06Infrastructure-Foundations, 10probenet: compare Probenet data w/ NEL data - https://phabricator.wikimedia.org/T337317#9768656 (10CDanis) [17:04:56] 06SRE, 06Infrastructure-Foundations, 10netops, 10probenet, and 2 others: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9768659 (10CDanis) [17:07:33] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9768669 (10andrea.denisse) Output of the requested commands: ` denisse@centrallog1002:~$ sudo sgdisk -R=/dev/sdg /dev/sdh The operation has completed successfully. ` ` denisse@centrallog1002:~$ sudo sgdisk -G /dev... [17:11:47] FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:13:47] !log Run `sudo mdadm --add /dev/md1 /dev/sdg` on `centrallog1002` - T363660 [17:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:51] T363660: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660 [17:14:14] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7002.magru.wmnet with OS bookworm [17:14:14] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7002.magru.wmnet [17:15:05] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9768720 (10andrea.denisse) ` denisse@centrallog1002:~$ sudo mdadm --add /dev/md1 /dev/sdg mdadm: added /dev/sdg ` [17:17:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:27:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [17:27:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [17:36:19] (03CR) 10Jelto: [C:03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1026949 (https://phabricator.wikimedia.org/T364013) (owner: 10Ahmon Dancy) [17:45:24] !log repooling wdqs1012 [17:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:30] (03CR) 10Dwisehaupt: [C:03+2] delete civicrm-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1026986 (owner: 10Dzahn) [18:17:50] (03CR) 10Dwisehaupt: [C:03+2] "Looks good. I'll check in a bit to see if there are any civi1001 references to clean up." [dns] - 10https://gerrit.wikimedia.org/r/1026986 (owner: 10Dzahn) [18:18:59] (03CR) 10Dzahn: "thanks :)" [dns] - 10https://gerrit.wikimedia.org/r/1026986 (owner: 10Dzahn) [18:19:09] (03PS2) 10Dzahn: delete civicrm-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1026986 [18:28:37] !log brett@cumin2002 conftool action : set/weight=1; selector: name=ncredir7001.magru.wmnet,service=nginx [18:29:13] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir7001.magru.wmnet,service=nginx [18:29:19] !log brett@cumin2002 conftool action : set/weight=1; selector: name=ncredir7002.magru.wmnet,service=nginx [18:29:25] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir7002.magru.wmnet,service=nginx [18:30:42] FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:31:47] FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:32:30] brett: ^ I think this matches "possible causes: A service with no backends weighted/pooled" [18:32:43] Hm [18:32:46] based on https://config-master.wikimedia.org/pybal/magru/ there is no service nginx in magru yet ? [18:33:23] There should be [18:33:23] (03PS1) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) [18:33:27] is it supposed to be service=ncredir or =ncredir-https ? [18:33:59] because that's where the backends show up: https://config-master.wikimedia.org/pybal/magru/ncredir [18:34:37] (03PS2) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) [18:34:57] (03PS1) 10Ahmon Dancy: Use buildkit wmf-v0.13.2-1 on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1027002 (https://phabricator.wikimedia.org/T364013) [18:34:57] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper) [18:36:06] Everything seems to match up with e.g. eqiad [18:36:37] I don't see any service called nginx though? [18:36:37] Maybe pybal needs a restart? [18:36:43] (03CR) 10Ahmon Dancy: "Sorry for the multiple changes. I need to be better search/replace next time around." [puppet] - 10https://gerrit.wikimedia.org/r/1027002 (https://phabricator.wikimedia.org/T364013) (owner: 10Ahmon Dancy) [18:37:03] while it mentions "/pools/eqsin/ncredir/nginx/" the service is called "ncredir", no? [18:37:33] oh, oh [18:38:27] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir7002.magru.wmnet,service=nginx [18:38:32] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir7001.magru.wmnet,service=nginx [18:39:36] mutante: cluster ncredir, service nginx [18:39:53] lvs: [18:39:53] class: high-traffic1 [18:39:53] conftool: [18:39:53] cluster: ncredir [18:39:53] service: nginx [18:39:57] this one basically [18:40:12] sukhe: oh, right! ack [18:43:47] (03PS3) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) [18:43:48] https://prometheus-codfw.wikimedia.org/ops/graph?g0.expr=sum%20by%20(name%2C%20instance)%20(confd_resource_healthy)%20%2F%20count%20by%20(name%2C%20instance)%20(confd_resource_healthy)%20%3C%201&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h [18:43:58] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir7001.magru.wmnet,service=nginx [18:44:03] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir7002.magru.wmnet,service=nginx [18:44:08] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper) [18:44:51] failed linting '/usr/local/bin/pybal-eval-check /srv/config-master/pybal/magru/.ncredir-https1451247916' with 1 (0.027543067932128906s) [invalid]: server pool cannot be empty! [18:45:15] yeah I think I am going to clean the state [18:46:03] (03CR) 10Jelto: [C:03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1027002 (https://phabricator.wikimedia.org/T364013) (owner: 10Ahmon Dancy) [18:46:04] mutante: so what happened was at one point there was nothing pooled for magru [18:46:35] and essentially we were waiting for us to finish reimaging and pooling everything. since brett finished ncredir, we will try to clean it up [18:46:59] sukhe: nod... maybe it dislikes it just because the previous state was "pool empty" [18:47:09] yep [18:47:50] https://phabricator.wikimedia.org/T363924 see also on why we are getting alerted for non-magru sites [18:49:00] wow, and that ticket 2 days old.. there is always already something :) [18:49:03] which was a TIL for me till swfrench-wm.f filed it :) [18:50:48] when you say "clear the state", you mean something like: [18:50:50] [puppetmaster2001:/var/run/confd-template] $ rm .ncredir* [18:50:51] ? [18:52:15] yep [18:52:21] https://wikitech.wikimedia.org/wiki/Confd#Stale_template_error_files_present [18:52:49] yea, I remember that one. when deleting the .err files the monitoring cleared up [18:52:52] and similar for .upload and .text [18:53:04] I think I will do it now [18:53:13] since we have done everything in magru [19:00:42] FIRING: [72x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:01:27] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:02:04] !log cleaning up stale confd template files for magru related reimaging [19:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:56] 06SRE, 06collaboration-services, 06serviceops: upgrade deployment servers to bullseye / add bullseye support to puppet role - https://phabricator.wikimedia.org/T363415#9768969 (10Dzahn) [19:05:42] FIRING: [72x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:06:47] RESOLVED: [72x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:06:53] ok :) [19:27:53] (03CR) 10Bking: "based on PCC , it looks like we might need to add codfw as a valid site for wdqs-test cluster in the list of clusters...which is in hierad" [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper) [19:33:05] (03PS4) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) [19:38:58] (03PS5) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) [19:39:05] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper) [19:47:26] (03PS1) 10JHathaway: pcc: fix delete-canceled-pcc-run-dirs timer [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) [19:47:49] (03CR) 10CI reject: [V:04-1] pcc: fix delete-canceled-pcc-run-dirs timer [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) (owner: 10JHathaway) [19:48:32] (03PS6) 10Ryan Kemper: wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) [19:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:56:09] (03PS2) 10JHathaway: pcc: fix delete-canceled-pcc-run-dirs timer [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) [19:56:32] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper) [19:59:52] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) (owner: 10JHathaway) [20:00:43] RECOVERY - MD RAID on centrallog1002 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:59:39] (03CR) 10Bking: [C:03+1] wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper) [21:22:10] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:27:31] !log T362920 [wdqs] Depooled `wdqs2023` in preparation to switch it to a graph split host [21:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:34] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors) - https://phabricator.wikimedia.org/T362920 [21:29:09] (03CR) 10Ryan Kemper: [C:03+2] wdqs: switch wdqs2023 to graph split host [puppet] - 10https://gerrit.wikimedia.org/r/1027001 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper) [21:34:30] FIRING: [2x] ProbeDown: Service wdqs2023:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:38:10] ^ Forgot to downtime [21:38:44] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on wdqs2023.codfw.wmnet with reason: T362920 [21:38:47] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors) - https://phabricator.wikimedia.org/T362920 [21:38:48] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on wdqs2023.codfw.wmnet with reason: T362920 [22:01:04] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9769326 (10andrea.denisse) The resync finished. ` sudo cat /proc/mdstat centrallog1002: Fri May 3 22:00:07 2024 Personalities : [raid10] [linear] [multipath] [... [22:02:14] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9769327 (10andrea.denisse) Thanks to @VRiley-WMF and @Jclark-ctr for their help debugging and troubleshooting this issue, it was a hard one! ❤ [22:06:36] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9769328 (10andrea.denisse) 05Open→03Resolved [22:20:48] RECOVERY - Router interfaces on cr1-magru is OK: OK: host 195.200.68.128, interfaces up: 48, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:23:03] (03PS1) 10Scott French: mathoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027050 [22:27:22] (03PS2) 10Scott French: mathoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027050 (https://phabricator.wikimedia.org/T362978) [22:33:57] (03CR) 10Scott French: "Decided to give this a try on an "easy mode" chart after our chat this morning. If you have cycles to review, that would be greatly apprec" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027050 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [22:44:36] (03PS1) 10Dzahn: admin: create group fr-tech-devs, apply to role crm - WIP [puppet] - 10https://gerrit.wikimedia.org/r/1027052 [23:01:27] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:38:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1026893 [23:38:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1026893 (owner: 10TrainBranchBot) [23:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors