[00:11:02] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 80471328 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:12:02] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6840 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:38] PROBLEM - MD RAID on centrallog1002 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:43:39] ACKNOWLEDGEMENT - MD RAID on centrallog1002 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T363660 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:43:43] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660 (10ops-monitoring-bot) 03NEW [01:22:43] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [01:28:24] PROBLEM - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:28:25] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T363661 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:28:36] 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T363661 (10ops-monitoring-bot) 03NEW [01:42:17] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS bookworm [02:05:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:05:26] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:13:53] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:38] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1003.eqiad.wmnet with reason: host reimage [02:17:25] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1003.eqiad.wmnet with reason: host reimage [02:20:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:25:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:38:53] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:59] PROBLEM - Disk space on prometheus2005 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s 50350 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus2005&var-datasource=codfw+prometheus/ops [03:00:26] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:26] (SystemdUnitFailed) resolved: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:31] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1003.eqiad.wmnet with OS bookworm [03:16:28] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:18:59] (03CR) 10Andrew Bogott: [C:03+2] wmcs VM backups: move all backups to one host, cloudbackup1003 [puppet] - 10https://gerrit.wikimedia.org/r/1023467 (https://phabricator.wikimedia.org/T332400) (owner: 10Andrew Bogott) [03:40:54] 06SRE, 10SRE-swift-storage, 10Thumbor, 06Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9751601 (10tstarling) >>! In T345334#9654752, @Ladsgroup wrote: > If we do extrapolation after 10,000th hit. The Theil-Sen extrapolation becomes more useful: >... [04:07:59] PROBLEM - Disk space on prometheus2005 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s 49635 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus2005&var-datasource=codfw+prometheus/ops [04:22:25] (SystemdUnitFailed) firing: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:58:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T352010)', diff saved to https://phabricator.wikimedia.org/P61301 and previous config saved to /var/cache/conftool/dbconfig/20240429-045851-ladsgroup.json [04:58:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:04:49] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T322187 [05:04:54] T322187: Switchover es5 master (es1023 -> es1024) - https://phabricator.wikimedia.org/T322187 [05:04:54] (03PS1) 10Marostegui: db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024997 (https://phabricator.wikimedia.org/T361548) [05:05:06] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T322187 [05:07:15] (03CR) 10Marostegui: [C:03+2] db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024997 (https://phabricator.wikimedia.org/T361548) (owner: 10Marostegui) [05:07:58] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024997 (https://phabricator.wikimedia.org/T361548) (owner: 10Marostegui) [05:08:37] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:1024997|db-production.php: Disable writes on es5 (T361548)]] [05:08:41] T361548: Switchover es5 master (es1023 -> es1024) - https://phabricator.wikimedia.org/T361548 [05:12:30] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024725 [05:13:39] (03PS1) 10Marostegui: mariadb: Promote es1024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/1024998 (https://phabricator.wikimedia.org/T361548) [05:13:53] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P61302 and previous config saved to /var/cache/conftool/dbconfig/20240429-051359-ladsgroup.json [05:14:52] (03PS1) 10Marostegui: wmnet: Update es5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1024999 (https://phabricator.wikimedia.org/T361548) [05:15:26] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:22:32] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:1024997|db-production.php: Disable writes on es5 (T361548)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [05:22:36] !log marostegui@deploy1002 marostegui: Continuing with sync [05:22:37] T361548: Switchover es5 master (es1023 -> es1024) - https://phabricator.wikimedia.org/T361548 [05:22:43] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:23:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es1024 with weight 0 T361548', diff saved to https://phabricator.wikimedia.org/P61303 and previous config saved to /var/cache/conftool/dbconfig/20240429-052311-root.json [05:28:37] PROBLEM - Check whether ferm is active by checking the default input chain on mw1463 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [05:29:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P61304 and previous config saved to /var/cache/conftool/dbconfig/20240429-052906-ladsgroup.json [05:29:45] PROBLEM - Check whether ferm is active by checking the default input chain on mw2310 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [05:30:26] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:30:49] PROBLEM - Check whether ferm is active by checking the default input chain on mw2427 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [05:33:53] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:35] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:1024997|db-production.php: Disable writes on es5 (T361548)]] (duration: 26m 58s) [05:35:39] T361548: Switchover es5 master (es1023 -> es1024) - https://phabricator.wikimedia.org/T361548 [05:36:14] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es1024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/1024998 (https://phabricator.wikimedia.org/T361548) (owner: 10Marostegui) [05:37:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:38:17] (03PS1) 10Marostegui: es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025001 (https://phabricator.wikimedia.org/T361548) [05:40:08] !log Starting es5 eqiad failover from es1023 to es1024 T361548 [05:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1024 to es5 primary T361548', diff saved to https://phabricator.wikimedia.org/P61305 and previous config saved to /var/cache/conftool/dbconfig/20240429-054035-marostegui.json [05:40:40] T361548: Switchover es5 master (es1023 -> es1024) - https://phabricator.wikimedia.org/T361548 [05:41:10] (03CR) 10Marostegui: [C:03+2] wmnet: Update es5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1024999 (https://phabricator.wikimedia.org/T361548) (owner: 10Marostegui) [05:41:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1023 T361548', diff saved to https://phabricator.wikimedia.org/P61306 and previous config saved to /var/cache/conftool/dbconfig/20240429-054158-marostegui.json [05:42:37] (03CR) 10Marostegui: [C:03+2] Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024725 (owner: 10Marostegui) [05:42:42] (03CR) 10Marostegui: [C:03+2] es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025001 (https://phabricator.wikimedia.org/T361548) (owner: 10Marostegui) [05:43:21] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024725 (owner: 10Marostegui) [05:43:54] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:1024725|Revert "db-production.php: Disable writes on es5"]] [05:44:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T352010)', diff saved to https://phabricator.wikimedia.org/P61308 and previous config saved to /var/cache/conftool/dbconfig/20240429-054413-ladsgroup.json [05:44:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [05:44:19] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:44:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [05:44:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2123.codfw.wmnet with reason: Maintenance [05:45:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2123.codfw.wmnet with reason: Maintenance [05:45:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T361627)', diff saved to https://phabricator.wikimedia.org/P61309 and previous config saved to /var/cache/conftool/dbconfig/20240429-054519-marostegui.json [05:45:27] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:46:22] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:1024725|Revert "db-production.php: Disable writes on es5"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [05:46:25] !log marostegui@deploy1002 marostegui: Continuing with sync [05:48:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T361627)', diff saved to https://phabricator.wikimedia.org/P61310 and previous config saved to /var/cache/conftool/dbconfig/20240429-054850-marostegui.json [05:52:13] PROBLEM - Check whether ferm is active by checking the default input chain on mw2419 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [05:58:39] RECOVERY - Check whether ferm is active by checking the default input chain on mw1463 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [05:58:41] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:1024725|Revert "db-production.php: Disable writes on es5"]] (duration: 14m 47s) [05:59:45] RECOVERY - Check whether ferm is active by checking the default input chain on mw2310 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:00:49] RECOVERY - Check whether ferm is active by checking the default input chain on mw2427 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:02:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1023.eqiad.wmnet with OS bookworm [06:03:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P61311 and previous config saved to /var/cache/conftool/dbconfig/20240429-060358-marostegui.json [06:04:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1212', diff saved to https://phabricator.wikimedia.org/P61312 and previous config saved to /var/cache/conftool/dbconfig/20240429-060423-root.json [06:05:01] (03PS1) 10Marostegui: db1212: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025002 (https://phabricator.wikimedia.org/T362134) [06:05:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:05:37] (03CR) 10Marostegui: [C:03+2] db1212: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025002 (https://phabricator.wikimedia.org/T362134) (owner: 10Marostegui) [06:06:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1212.eqiad.wmnet with OS bookworm [06:14:36] !log Restart sanitarium instances in codfw T363276 [06:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:42] T363276: Prepare and check storage layer for sysop_plwiki - https://phabricator.wikimedia.org/T363276 [06:17:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1023.eqiad.wmnet with reason: host reimage [06:19:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P61313 and previous config saved to /var/cache/conftool/dbconfig/20240429-061905-marostegui.json [06:20:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1212.eqiad.wmnet with reason: host reimage [06:21:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1023.eqiad.wmnet with reason: host reimage [06:22:13] RECOVERY - Check whether ferm is active by checking the default input chain on mw2419 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:23:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1212.eqiad.wmnet with reason: host reimage [06:24:32] !log Restart sanitarium instances in eqiad T363276 [06:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:38] T363276: Prepare and check storage layer for sysop_plwiki - https://phabricator.wikimedia.org/T363276 [06:27:00] (03PS1) 10Marostegui: Revert "db1212: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025166 [06:27:08] (03PS1) 10Marostegui: Revert "es1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025167 [06:34:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T361627)', diff saved to https://phabricator.wikimedia.org/P61314 and previous config saved to /var/cache/conftool/dbconfig/20240429-063412-marostegui.json [06:34:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2128.codfw.wmnet with reason: Maintenance [06:34:18] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:34:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2128.codfw.wmnet with reason: Maintenance [06:34:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [06:34:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [06:34:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T361627)', diff saved to https://phabricator.wikimedia.org/P61315 and previous config saved to /var/cache/conftool/dbconfig/20240429-063450-marostegui.json [06:38:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T361627)', diff saved to https://phabricator.wikimedia.org/P61316 and previous config saved to /var/cache/conftool/dbconfig/20240429-063819-marostegui.json [06:43:46] (03CR) 10Marostegui: [C:03+2] Revert "db1212: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025166 (owner: 10Marostegui) [06:43:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1023.eqiad.wmnet with OS bookworm [06:44:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61317 and previous config saved to /var/cache/conftool/dbconfig/20240429-064420-root.json [06:45:05] (03CR) 10Marostegui: [C:03+2] Revert "es1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025167 (owner: 10Marostegui) [06:46:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1212.eqiad.wmnet with OS bookworm [06:46:56] (03PS1) 10Marostegui: db2159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025148 (https://phabricator.wikimedia.org/T362745) [06:47:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2159', diff saved to https://phabricator.wikimedia.org/P61318 and previous config saved to /var/cache/conftool/dbconfig/20240429-064717-root.json [06:47:41] (03CR) 10Marostegui: [C:03+2] db2159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025148 (https://phabricator.wikimedia.org/T362745) (owner: 10Marostegui) [06:48:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2159.codfw.wmnet with OS bookworm [06:50:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61319 and previous config saved to /var/cache/conftool/dbconfig/20240429-065022-root.json [06:53:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P61320 and previous config saved to /var/cache/conftool/dbconfig/20240429-065326-marostegui.json [06:54:23] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2155/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024615 (owner: 10Muehlenhoff) [06:56:41] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1024615 (owner: 10Muehlenhoff) [06:59:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61321 and previous config saved to /var/cache/conftool/dbconfig/20240429-065926-root.json [07:00:05] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T0700). nyaa~ [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:27] indeed, nothing to do [07:01:32] (03CR) 10EoghanGaffney: [C:03+1] Deprecate system::role for Collaboration services (batch two) [puppet] - 10https://gerrit.wikimedia.org/r/1024615 (owner: 10Muehlenhoff) [07:03:51] (03PS1) 10Marostegui: Revert "db2159: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025170 [07:05:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61322 and previous config saved to /var/cache/conftool/dbconfig/20240429-070527-root.json [07:07:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2159.codfw.wmnet with reason: host reimage [07:08:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P61323 and previous config saved to /var/cache/conftool/dbconfig/20240429-070834-marostegui.json [07:10:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2159.codfw.wmnet with reason: host reimage [07:13:40] !log Upgrade idm.wikimedia.org to Bitu 0.7.0 [07:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61324 and previous config saved to /var/cache/conftool/dbconfig/20240429-071431-root.json [07:16:28] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:20:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61325 and previous config saved to /var/cache/conftool/dbconfig/20240429-072033-root.json [07:23:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T361627)', diff saved to https://phabricator.wikimedia.org/P61326 and previous config saved to /var/cache/conftool/dbconfig/20240429-072341-marostegui.json [07:23:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2157.codfw.wmnet with reason: Maintenance [07:23:47] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:23:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2157.codfw.wmnet with reason: Maintenance [07:24:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T361627)', diff saved to https://phabricator.wikimedia.org/P61327 and previous config saved to /var/cache/conftool/dbconfig/20240429-072404-marostegui.json [07:24:20] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025247 (https://phabricator.wikimedia.org/T349774) [07:25:25] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1024751 (https://phabricator.wikimedia.org/T363668) [07:26:31] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025247 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [07:26:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2159.codfw.wmnet with OS bookworm [07:27:26] (03CR) 10Marostegui: [C:03+2] Revert "db2159: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025170 (owner: 10Marostegui) [07:27:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T361627)', diff saved to https://phabricator.wikimedia.org/P61328 and previous config saved to /var/cache/conftool/dbconfig/20240429-072731-marostegui.json [07:27:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61329 and previous config saved to /var/cache/conftool/dbconfig/20240429-072755-root.json [07:29:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61330 and previous config saved to /var/cache/conftool/dbconfig/20240429-072937-root.json [07:32:47] (03PS1) 10Slyngshede: SSH Keymanagement, fix listing on mobile. [software/bitu] - 10https://gerrit.wikimedia.org/r/1025270 [07:33:52] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [07:34:17] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [07:34:18] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [07:34:46] !log Drop machinevision tables on testcommonswiki T362229 [07:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:51] T362229: Drop MachineVision tables from beta and production - https://phabricator.wikimedia.org/T362229 [07:34:53] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [07:34:54] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [07:35:22] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [07:35:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61331 and previous config saved to /var/cache/conftool/dbconfig/20240429-073539-root.json [07:35:50] (03PS2) 10Slyngshede: SSH Keymanagement, fix listing on mobile. [software/bitu] - 10https://gerrit.wikimedia.org/r/1025270 [07:37:27] !log Drop machinevision tables on commonswiki T362229 [07:37:29] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025271 (https://phabricator.wikimedia.org/T349774) [07:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:25] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025271 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [07:40:30] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025271 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [07:42:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P61332 and previous config saved to /var/cache/conftool/dbconfig/20240429-074238-marostegui.json [07:43:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61333 and previous config saved to /var/cache/conftool/dbconfig/20240429-074301-root.json [07:44:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61334 and previous config saved to /var/cache/conftool/dbconfig/20240429-074444-root.json [07:47:51] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [07:48:11] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [07:48:12] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [07:48:51] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [07:48:52] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [07:49:19] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [07:50:25] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:43] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025272 (https://phabricator.wikimedia.org/T219903) [07:50:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61335 and previous config saved to /var/cache/conftool/dbconfig/20240429-075045-root.json [07:51:33] (03PS2) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025272 (https://phabricator.wikimedia.org/T219903) [07:52:29] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [07:52:33] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [07:52:34] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [07:52:37] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [07:52:38] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [07:52:41] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [07:53:20] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025272 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [07:54:35] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025272 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [07:57:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P61336 and previous config saved to /var/cache/conftool/dbconfig/20240429-075746-marostegui.json [07:58:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61337 and previous config saved to /var/cache/conftool/dbconfig/20240429-075806-root.json [07:58:34] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [07:58:50] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [07:58:51] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [07:59:12] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [07:59:13] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [07:59:31] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [07:59:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61338 and previous config saved to /var/cache/conftool/dbconfig/20240429-075949-root.json [08:00:12] !log restarting blazegraph on wdqs1019 (BlazegraphFreeAllocatorsDecreasingRapidly) [08:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:42] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [08:04:45] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [08:04:46] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [08:04:48] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [08:04:49] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [08:04:52] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [08:05:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61339 and previous config saved to /var/cache/conftool/dbconfig/20240429-080550-root.json [08:07:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1223', diff saved to https://phabricator.wikimedia.org/P61340 and previous config saved to /var/cache/conftool/dbconfig/20240429-080710-root.json [08:07:43] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:08:01] (03PS1) 10Marostegui: db1223: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025275 (https://phabricator.wikimedia.org/T362134) [08:08:28] jouncebot: nowandnext [08:08:28] No deployments scheduled for the next 1 hour(s) and 51 minute(s) [08:08:28] In 1 hour(s) and 51 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T1000) [08:08:54] (03CR) 10Marostegui: [C:03+2] db1223: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025275 (https://phabricator.wikimedia.org/T362134) (owner: 10Marostegui) [08:09:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1223.eqiad.wmnet with OS bookworm [08:09:19] (03PS1) 10Majavah: Fix disabling TOTP keys with scratch tokens [extensions/OATHAuth] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1025174 (https://phabricator.wikimedia.org/T363548) [08:09:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/OATHAuth] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1025174 (https://phabricator.wikimedia.org/T363548) (owner: 10Majavah) [08:11:43] !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2003.codfw.wmnet [08:11:44] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:12:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T361627)', diff saved to https://phabricator.wikimedia.org/P61341 and previous config saved to /var/cache/conftool/dbconfig/20240429-081254-marostegui.json [08:12:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2171.codfw.wmnet with reason: Maintenance [08:12:59] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:13:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2171.codfw.wmnet with reason: Maintenance [08:13:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61342 and previous config saved to /var/cache/conftool/dbconfig/20240429-081312-root.json [08:13:24] (03PS1) 10DCausse: Revert "cirrus: Shift autocomplete traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025176 [08:13:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T361627)', diff saved to https://phabricator.wikimedia.org/P61343 and previous config saved to /var/cache/conftool/dbconfig/20240429-081323-marostegui.json [08:14:31] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [08:14:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61344 and previous config saved to /var/cache/conftool/dbconfig/20240429-081455-root.json [08:15:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [08:15:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:15:19] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster2003.codfw.wmnet on all recursors [08:15:22] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2003.codfw.wmnet on all recursors [08:15:40] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:49] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [08:16:35] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [08:17:36] (03Merged) 10jenkins-bot: Fix disabling TOTP keys with scratch tokens [extensions/OATHAuth] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1025174 (https://phabricator.wikimedia.org/T363548) (owner: 10Majavah) [08:17:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T361627)', diff saved to https://phabricator.wikimedia.org/P61345 and previous config saved to /var/cache/conftool/dbconfig/20240429-081754-marostegui.json [08:17:55] !log taavi@deploy1002 Started scap: Backport for [[gerrit:1025174|Fix disabling TOTP keys with scratch tokens (T363548)]] [08:17:59] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:18:04] T363548: Attempting to disable TOTP OATHAuth using a scratch code fails, but still consumes a scratch code - https://phabricator.wikimedia.org/T363548 [08:20:28] !log taavi@deploy1002 taavi: Backport for [[gerrit:1025174|Fix disabling TOTP keys with scratch tokens (T363548)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:20:50] !log taavi@deploy1002 taavi: Continuing with sync [08:20:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61346 and previous config saved to /var/cache/conftool/dbconfig/20240429-082056-root.json [08:21:17] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2003.codfw.wmnet with OS bullseye [08:21:28] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9752050 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast... [08:22:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1223.eqiad.wmnet with reason: host reimage [08:24:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1223.eqiad.wmnet with reason: host reimage [08:24:43] (03PS1) 10Marostegui: Revert "db1223: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025177 [08:27:02] ooh, scap now has a progress bar for all the k8s pod replacements [08:28:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61347 and previous config saved to /var/cache/conftool/dbconfig/20240429-082817-root.json [08:29:30] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [08:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:38] (03PS2) 10DCausse: Revert "cirrus: Shift autocomplete traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025176 (https://phabricator.wikimedia.org/T363516) [08:33:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P61348 and previous config saved to /var/cache/conftool/dbconfig/20240429-083301-marostegui.json [08:33:23] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1025174|Fix disabling TOTP keys with scratch tokens (T363548)]] (duration: 15m 27s) [08:33:28] T363548: Attempting to disable TOTP OATHAuth using a scratch code fails, but still consumes a scratch code - https://phabricator.wikimedia.org/T363548 [08:33:37] (03CR) 10Filippo Giunchedi: [C:03+1] Enable profile::auto_restarts::service for redis/arclamp [puppet] - 10https://gerrit.wikimedia.org/r/1024263 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:33:49] (03CR) 10Filippo Giunchedi: [C:03+1] Enable profile::auto_restarts::service for alertmanager-webhook-logger [puppet] - 10https://gerrit.wikimedia.org/r/1024288 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:34:11] (03CR) 10Filippo Giunchedi: [C:03+1] arclamp: Enable profile::auto_restarts::service for Redis [puppet] - 10https://gerrit.wikimedia.org/r/1024630 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:34:37] (03CR) 10Filippo Giunchedi: [C:03+2] Revert temporary monitoring for scraper [puppet] - 10https://gerrit.wikimedia.org/r/1023152 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight) [08:35:45] (03CR) 10Majavah: "duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1024347?" [puppet] - 10https://gerrit.wikimedia.org/r/1024647 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:35:53] (03CR) 10Filippo Giunchedi: [C:03+1] Add magru network to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1024895 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [08:35:56] (03CR) 10Majavah: [C:03+1] cloudweb: Enable profile::auto_restarts::service for apache/envoy [puppet] - 10https://gerrit.wikimedia.org/r/1024347 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:36:42] (03CR) 10Majavah: [C:03+1] "This seems fine, although I'm hoping that the `_ovs` role will be merged back to the main `::net` role rather soon." [puppet] - 10https://gerrit.wikimedia.org/r/1024620 (owner: 10Muehlenhoff) [08:36:43] (03CR) 10Filippo Giunchedi: [C:03+1] alertmanager: irc: clarify count and move firing to beginning [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [08:37:06] (03CR) 10Filippo Giunchedi: [C:03+1] Enable profile::auto_restarts::service for Benthos instances [puppet] - 10https://gerrit.wikimedia.org/r/1023883 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:37:32] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage [08:40:07] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9752079 (10fgiunchedi) dmesg ` [21683262.744660] sd 8:0:0:0: [sdg] tag#5 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s [21683262.744665] sd 8:0:0:0: [sdg] tag#5 CDB: Read(10) 28 00 00 00 00... [08:40:23] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage [08:41:29] (03CR) 10Ayounsi: [C:03+2] Add magru network to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1024895 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [08:41:45] (03CR) 10Filippo Giunchedi: [C:03+1] istio_slos: add secondary recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1024790 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [08:42:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:43:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61349 and previous config saved to /var/cache/conftool/dbconfig/20240429-084323-root.json [08:44:04] (03CR) 10Marostegui: [C:03+2] Revert "db1223: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025177 (owner: 10Marostegui) [08:44:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61350 and previous config saved to /var/cache/conftool/dbconfig/20240429-084447-root.json [08:45:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1223.eqiad.wmnet with OS bookworm [08:47:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:47:55] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1024752 (https://phabricator.wikimedia.org/T363672) [08:47:59] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1024753 (https://phabricator.wikimedia.org/T363672) [08:48:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P61351 and previous config saved to /var/cache/conftool/dbconfig/20240429-084808-marostegui.json [08:50:02] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:51:04] PROBLEM - Router interfaces on cr2-magru is CRITICAL: CRITICAL: host 195.200.68.129, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:54:06] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2003.codfw.wmnet with OS bullseye [08:54:06] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster2003.codfw.wmnet [08:54:14] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9752129 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster20... [08:55:12] (03CR) 10JMeybohm: [C:03+2] Kubernetes: Drop unused etcd_srv_name [puppet] - 10https://gerrit.wikimedia.org/r/1024406 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:55:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 35.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:55:16] (03CR) 10JMeybohm: [V:03+1 C:03+2] Disable boostrap mode on all k8s etcd clusters [puppet] - 10https://gerrit.wikimedia.org/r/1024395 (owner: 10JMeybohm) [08:56:54] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:57:06] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:57:11] (03CR) 10JMeybohm: [C:03+1] aptrepo: Add new repository component and repo sync config for Node 20 [puppet] - 10https://gerrit.wikimedia.org/r/1024663 (https://phabricator.wikimedia.org/T362681) (owner: 10Muehlenhoff) [08:57:12] (03PS1) 10Ladsgroup: rdbms: Protect against stale cache in LB::getMaxLag() [core] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1025178 (https://phabricator.wikimedia.org/T361824) [08:57:22] jouncebot: nowandnext [08:57:22] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [08:57:23] In 1 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T1000) [08:57:32] (03CR) 10Ladsgroup: [C:03+2] rdbms: Protect against stale cache in LB::getMaxLag() [core] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1025178 (https://phabricator.wikimedia.org/T361824) (owner: 10Ladsgroup) [08:58:08] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:58:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61352 and previous config saved to /var/cache/conftool/dbconfig/20240429-085829-root.json [08:59:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61353 and previous config saved to /var/cache/conftool/dbconfig/20240429-085953-root.json [09:00:08] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:00:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:00:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s7 T363668 [09:00:37] T363668: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T363668 [09:00:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2218 with weight 0 T363668', diff saved to https://phabricator.wikimedia.org/P61354 and previous config saved to /var/cache/conftool/dbconfig/20240429-090046-marostegui.json [09:00:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T363668 [09:01:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 37.41% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:02:04] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1024751 (https://phabricator.wikimedia.org/T363668) (owner: 10Gerrit maintenance bot) [09:03:08] RECOVERY - Router interfaces on cr2-magru is OK: OK: host 195.200.68.129, interfaces up: 46, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:03:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T361627)', diff saved to https://phabricator.wikimedia.org/P61355 and previous config saved to /var/cache/conftool/dbconfig/20240429-090317-marostegui.json [09:03:18] (03PS1) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) [09:03:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2178.codfw.wmnet with reason: Maintenance [09:03:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2178.codfw.wmnet with reason: Maintenance [09:03:23] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:03:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T361627)', diff saved to https://phabricator.wikimedia.org/P61356 and previous config saved to /var/cache/conftool/dbconfig/20240429-090329-marostegui.json [09:03:35] !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2003.codfw.wmnet to plain [09:04:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2003.codfw.wmnet to plain [09:05:42] 06SRE, 10SRE-swift-storage, 10Thumbor, 06Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9752173 (10Ladsgroup) That'd work on overall hits, as you said "sort images by popularity". That's not the case here. Front caches absorb all of the hits and c... [09:05:43] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024690 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:06:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 36.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:06:28] (03CR) 10CI reject: [V:04-1] kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:07:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T361627)', diff saved to https://phabricator.wikimedia.org/P61357 and previous config saved to /var/cache/conftool/dbconfig/20240429-090701-marostegui.json [09:08:34] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2158/console" [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:08:41] (03PS2) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) [09:09:23] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1024656 (owner: 10Muehlenhoff) [09:14:14] (03PS1) 10Ayounsi: mr: only allow ssh from bast hosts on production side [homer/public] - 10https://gerrit.wikimedia.org/r/1025279 (https://phabricator.wikimedia.org/T362522) [09:15:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61358 and previous config saved to /var/cache/conftool/dbconfig/20240429-091500-root.json [09:15:27] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9752198 (10JMeybohm) 05Open→03Resolved [09:15:44] (03Merged) 10jenkins-bot: rdbms: Protect against stale cache in LB::getMaxLag() [core] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1025178 (https://phabricator.wikimedia.org/T361824) (owner: 10Ladsgroup) [09:16:48] (03CR) 10Btullis: [C:03+2] Use an LVM volume for /var/lib/ceph on cephosd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1024695 (https://phabricator.wikimedia.org/T324660) (owner: 10Btullis) [09:18:01] (03CR) 10Btullis: [V:03+1 C:03+2] Fix the cephosd server reimages [puppet] - 10https://gerrit.wikimedia.org/r/1024773 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [09:18:13] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1025178|rdbms: Protect against stale cache in LB::getMaxLag() (T361824)]] [09:18:17] T361824: PHP Notice: Undefined offset in rdbms/loadbalancer/LoadBalancer.php - https://phabricator.wikimedia.org/T361824 [09:20:07] !log Starting s7 codfw failover from db2121 to db2218 - T363668 [09:20:11] (03PS2) 10Ayounsi: mr: only allow ssh from bast hosts on production side [homer/public] - 10https://gerrit.wikimedia.org/r/1025279 (https://phabricator.wikimedia.org/T362522) [09:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:12] T363668: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T363668 [09:20:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2218 to s7 primary T363668', diff saved to https://phabricator.wikimedia.org/P61359 and previous config saved to /var/cache/conftool/dbconfig/20240429-092029-marostegui.json [09:20:44] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1025178|rdbms: Protect against stale cache in LB::getMaxLag() (T361824)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:21:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2121 T363668', diff saved to https://phabricator.wikimedia.org/P61360 and previous config saved to /var/cache/conftool/dbconfig/20240429-092104-root.json [09:21:15] (03CR) 10Jgiannelos: [C:03+1] wikifeeds: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023824 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [09:21:49] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023556 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [09:22:06] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023557 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [09:22:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2218 from api', diff saved to https://phabricator.wikimedia.org/P61361 and previous config saved to /var/cache/conftool/dbconfig/20240429-092213-marostegui.json [09:22:16] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1025279 (https://phabricator.wikimedia.org/T362522) (owner: 10Ayounsi) [09:22:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P61362 and previous config saved to /var/cache/conftool/dbconfig/20240429-092222-marostegui.json [09:22:47] (03CR) 10Ayounsi: [C:03+2] mr: only allow ssh from bast hosts on production side [homer/public] - 10https://gerrit.wikimedia.org/r/1025279 (https://phabricator.wikimedia.org/T362522) (owner: 10Ayounsi) [09:22:53] (03CR) 10Alexandros Kosiaris: [C:03+2] wikifeeds: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023824 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [09:23:09] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024359 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:23:10] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephadm1001.eqiad.wmnet with OS bullseye [09:23:15] 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9752240 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bullseye [09:24:04] (03PS1) 10Marostegui: db2121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025281 (https://phabricator.wikimedia.org/T362745) [09:24:05] (03Merged) 10jenkins-bot: wikifeeds: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023824 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [09:24:15] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bullseye [09:24:34] (03CR) 10Marostegui: [C:03+2] db2121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025281 (https://phabricator.wikimedia.org/T362745) (owner: 10Marostegui) [09:25:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2121.codfw.wmnet with OS bookworm [09:25:37] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [09:27:22] (03PS1) 10Marostegui: Revert "db2121: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025179 [09:29:21] (03CR) 10Btullis: [C:03+2] Migrate image-suggestions to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [09:30:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61363 and previous config saved to /var/cache/conftool/dbconfig/20240429-093007-root.json [09:30:11] (03Merged) 10jenkins-bot: Migrate image-suggestions to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [09:31:31] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [09:31:55] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [09:32:48] PROBLEM - Check whether ferm is active by checking the default input chain on mw2427 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:32:48] PROBLEM - Check whether ferm is active by checking the default input chain on mw2382 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:33:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.39% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:33:53] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:22] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [09:35:26] (03PS1) 10Jgiannelos: wikifeeds: Bump staging to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025282 [09:35:57] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [09:36:19] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [09:36:38] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [09:37:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P61364 and previous config saved to /var/cache/conftool/dbconfig/20240429-093729-marostegui.json [09:38:28] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1025178|rdbms: Protect against stale cache in LB::getMaxLag() (T361824)]] (duration: 20m 15s) [09:38:33] T361824: PHP Notice: Undefined offset in rdbms/loadbalancer/LoadBalancer.php - https://phabricator.wikimedia.org/T361824 [09:39:36] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [09:39:47] (03PS8) 10Btullis: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) [09:39:53] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [09:41:19] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [09:42:07] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [09:42:39] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [09:43:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2121.codfw.wmnet with reason: host reimage [09:43:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 36.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:43:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [09:45:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61365 and previous config saved to /var/cache/conftool/dbconfig/20240429-094512-root.json [09:47:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2121.codfw.wmnet with reason: host reimage [09:47:36] (03PS1) 10Santiago Faci: Creating staging and production helmfiles for MPIC (Metrics Platform Instrument Configurator) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) [09:48:46] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [09:49:03] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [09:52:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T361627)', diff saved to https://phabricator.wikimedia.org/P61366 and previous config saved to /var/cache/conftool/dbconfig/20240429-095237-marostegui.json [09:52:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2192.codfw.wmnet with reason: Maintenance [09:52:42] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:52:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2192.codfw.wmnet with reason: Maintenance [09:53:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T361627)', diff saved to https://phabricator.wikimedia.org/P61367 and previous config saved to /var/cache/conftool/dbconfig/20240429-095259-marostegui.json [09:53:27] (03CR) 10Alexandros Kosiaris: [C:03+2] nit: just a little clarification on comment [software] - 10https://gerrit.wikimedia.org/r/1023388 (owner: 10Fabfur) [09:53:44] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9752365 (10DMburugu) Approved. [09:53:57] (03Merged) 10jenkins-bot: nit: just a little clarification on comment [software] - 10https://gerrit.wikimedia.org/r/1023388 (owner: 10Fabfur) [09:56:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T361627)', diff saved to https://phabricator.wikimedia.org/P61368 and previous config saved to /var/cache/conftool/dbconfig/20240429-095629-marostegui.json [09:57:19] (03PS9) 10Btullis: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) [09:58:31] (03PS10) 10Btullis: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T1000) [10:00:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 37.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:00:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61369 and previous config saved to /var/cache/conftool/dbconfig/20240429-100018-root.json [10:02:49] RECOVERY - Check whether ferm is active by checking the default input chain on mw2427 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:02:49] RECOVERY - Check whether ferm is active by checking the default input chain on mw2382 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:03:29] (03PS1) 10Marostegui: db1191: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025287 (https://phabricator.wikimedia.org/T362745) [10:05:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:05:44] (03CR) 10Marostegui: [C:03+2] Revert "db2121: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025179 (owner: 10Marostegui) [10:06:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61370 and previous config saved to /var/cache/conftool/dbconfig/20240429-100605-root.json [10:09:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2121.codfw.wmnet with OS bookworm [10:11:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:11:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P61371 and previous config saved to /var/cache/conftool/dbconfig/20240429-101137-marostegui.json [10:13:13] (03PS2) 10Santiago Faci: Creating staging and production helmfiles for MPIC (Metrics Platform Instrument Configurator) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) [10:14:01] (03PS11) 10Btullis: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) [10:15:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61372 and previous config saved to /var/cache/conftool/dbconfig/20240429-101525-root.json [10:16:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:17:12] (03PS2) 10Marostegui: db1191: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025287 (https://phabricator.wikimedia.org/T362745) [10:17:12] (03PS1) 10Marostegui: mariadb: Productionize es6 [puppet] - 10https://gerrit.wikimedia.org/r/1025289 (https://phabricator.wikimedia.org/T355424) [10:17:59] (03CR) 10Marostegui: [C:03+2] db1191: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025287 (https://phabricator.wikimedia.org/T362745) (owner: 10Marostegui) [10:18:00] (03PS4) 10Effie Mouzeli: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [10:18:11] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es6 [puppet] - 10https://gerrit.wikimedia.org/r/1025289 (https://phabricator.wikimedia.org/T355424) (owner: 10Marostegui) [10:19:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1191 T362745', diff saved to https://phabricator.wikimedia.org/P61373 and previous config saved to /var/cache/conftool/dbconfig/20240429-101908-marostegui.json [10:19:23] T362745: Upgrade s7 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362745 [10:20:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1191.eqiad.wmnet with OS bookworm [10:21:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61374 and previous config saved to /var/cache/conftool/dbconfig/20240429-102111-root.json [10:22:29] (03PS1) 10Marostegui: valid_section.pp: Add es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025291 (https://phabricator.wikimedia.org/T355285) [10:22:49] (03CR) 10CI reject: [V:04-1] valid_section.pp: Add es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025291 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [10:23:15] (03PS2) 10Marostegui: valid_section.pp: Add es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025291 (https://phabricator.wikimedia.org/T355285) [10:24:55] (03PS1) 10Btullis: Fix the cephosd reimage process [puppet] - 10https://gerrit.wikimedia.org/r/1025292 (https://phabricator.wikimedia.org/T362993) [10:25:16] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1001.eqiad.wmnet with OS bullseye [10:26:12] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephadm1001.eqiad.wmnet with OS bullseye [10:26:17] 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9752473 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bullseye executed with errors: - cepha... [10:26:22] (03CR) 10Marostegui: [C:03+2] valid_section.pp: Add es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025291 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [10:26:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P61375 and previous config saved to /var/cache/conftool/dbconfig/20240429-102644-marostegui.json [10:26:49] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephadm1001.eqiad.wmnet with OS bookworm [10:26:53] 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9752476 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bookworm [10:27:05] (03CR) 10Jgiannelos: [C:03+2] wikifeeds: Bump staging to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025282 (owner: 10Jgiannelos) [10:28:12] (03Merged) 10jenkins-bot: wikifeeds: Bump staging to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025282 (owner: 10Jgiannelos) [10:28:34] (03PS3) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) [10:29:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 37.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:32:50] (03CR) 10Btullis: [C:03+2] Fix the cephosd reimage process [puppet] - 10https://gerrit.wikimedia.org/r/1025292 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [10:33:48] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1002.eqiad.wmnet with OS bullseye [10:33:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage [10:34:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 37.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:34:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1180', diff saved to https://phabricator.wikimedia.org/P61376 and previous config saved to /var/cache/conftool/dbconfig/20240429-103436-marostegui.json [10:35:15] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:35:28] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1180.eqiad.wmnet onto es1036.eqiad.wmnet [10:35:45] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:36:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61377 and previous config saved to /var/cache/conftool/dbconfig/20240429-103617-root.json [10:37:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage [10:37:19] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephadm1001.eqiad.wmnet with reason: host reimage [10:39:42] (03PS1) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) [10:40:21] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephadm1001.eqiad.wmnet with reason: host reimage [10:41:22] (03PS1) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) [10:41:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T361627)', diff saved to https://phabricator.wikimedia.org/P61378 and previous config saved to /var/cache/conftool/dbconfig/20240429-104152-marostegui.json [10:41:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2201.codfw.wmnet with reason: Maintenance [10:41:57] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:42:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2201.codfw.wmnet with reason: Maintenance [10:42:22] (03CR) 10CI reject: [V:04-1] admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [10:42:35] (03PS2) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) [10:44:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2211.codfw.wmnet with reason: Maintenance [10:44:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2211.codfw.wmnet with reason: Maintenance [10:45:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T361627)', diff saved to https://phabricator.wikimedia.org/P61379 and previous config saved to /var/cache/conftool/dbconfig/20240429-104501-marostegui.json [10:45:13] (03PS3) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) [10:46:00] (03PS4) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) [10:46:35] (03PS2) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) [10:47:09] (03PS4) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) [10:47:29] (03CR) 10CI reject: [V:04-1] kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [10:48:48] (03PS5) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) [10:49:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:49:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T361627)', diff saved to https://phabricator.wikimedia.org/P61380 and previous config saved to /var/cache/conftool/dbconfig/20240429-104923-marostegui.json [10:49:29] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:50:02] (03PS3) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) [10:50:18] (03PS1) 10Marostegui: Revert "db1191: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025184 [10:50:38] (03PS1) 10MVernon: cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) [10:50:58] (03CR) 10CI reject: [V:04-1] cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [10:51:09] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage [10:51:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61381 and previous config saved to /var/cache/conftool/dbconfig/20240429-105122-root.json [10:52:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61382 and previous config saved to /var/cache/conftool/dbconfig/20240429-105212-root.json [10:52:21] (03CR) 10Marostegui: [C:03+2] Revert "db1191: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025184 (owner: 10Marostegui) [10:54:04] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage [10:54:09] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephadm1001.eqiad.wmnet with OS bookworm [10:54:15] 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9752598 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bookworm completed: - cephadm1001 (**P... [10:54:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:54:59] (03PS1) 10KartikMistry: ContentTranslation: Update publishing setting for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025300 (https://phabricator.wikimedia.org/T353049) [10:56:06] (03PS6) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) [10:56:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1191.eqiad.wmnet with OS bookworm [10:56:41] (03CR) 10CI reject: [V:04-1] kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [10:57:18] (03PS2) 10MVernon: cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) [10:57:38] (03CR) 10CI reject: [V:04-1] cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [11:02:02] (03PS3) 10MVernon: cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) [11:03:32] (03PS14) 10Slyngshede: CloudIDM, Install Bitu for labtest [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) [11:03:49] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Added kubestagemaster2003 - jayme@cumin1002" [11:04:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P61383 and previous config saved to /var/cache/conftool/dbconfig/20240429-110430-marostegui.json [11:05:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Added kubestagemaster2003 - jayme@cumin1002" [11:06:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61384 and previous config saved to /var/cache/conftool/dbconfig/20240429-110628-root.json [11:06:32] (03PS15) 10Slyngshede: CloudIDM, Install Bitu for labtest [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) [11:07:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61385 and previous config saved to /var/cache/conftool/dbconfig/20240429-110717-root.json [11:08:05] (03PS4) 10MVernon: cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) [11:08:26] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 9 DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [11:08:52] (03CR) 10Slyngshede: [C:03+2] SSH Keymanagement, fix listing on mobile. [software/bitu] - 10https://gerrit.wikimedia.org/r/1025270 (owner: 10Slyngshede) [11:15:10] PROBLEM - Disk space on prometheus2006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s 47090 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus2006&var-datasource=codfw+prometheus/ops [11:15:15] 10ops-eqiad, 06SRE: PowerSupplyFailure - https://phabricator.wikimedia.org/T363409#9752644 (10VRiley-WMF) a:03VRiley-WMF [11:16:28] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:16:48] (03Merged) 10jenkins-bot: SSH Keymanagement, fix listing on mobile. [software/bitu] - 10https://gerrit.wikimedia.org/r/1025270 (owner: 10Slyngshede) [11:16:55] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 9 NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [11:17:36] (03CR) 10Brouberol: [C:03+1] datahub: create dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) (owner: 10Stevemunene) [11:18:08] (03CR) 10Brouberol: [C:03+1] Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [11:18:31] 10ops-eqiad, 06SRE: PowerSupplyFailure - https://phabricator.wikimedia.org/T363409#9752645 (10VRiley-WMF) 05Open→03Resolved Checked in the back and found one of the power cables had slipped out. Reseated it, and the power came back on. Closing ticket. [11:19:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:19:33] (03CR) 10Btullis: cephadm: new modules, profile, roles for cephadm-based Ceph clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [11:19:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P61386 and previous config saved to /var/cache/conftool/dbconfig/20240429-111938-marostegui.json [11:19:39] (03PS7) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) [11:21:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61387 and previous config saved to /var/cache/conftool/dbconfig/20240429-112134-root.json [11:22:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61388 and previous config saved to /var/cache/conftool/dbconfig/20240429-112223-root.json [11:22:50] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 10 DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [11:23:10] (03PS7) 10Urbanecm: Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 (https://phabricator.wikimedia.org/T346327) (owner: 10Cyndywikime) [11:24:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.38% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:24:59] 10ops-eqiad, 06SRE: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T363580#9752657 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This appears to be a duplicate ticket of T362841 Closing this one. [11:25:06] (03CR) 10Brouberol: [C:03+1] "Looks good! I have some suggestions that you should feel free to apply or ignore." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) (owner: 10Santiago Faci) [11:25:22] (03PS8) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) [11:25:47] 10ops-eqiad, 06SRE: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T363522#9752664 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This appears to be a duplicate of ticket T362841 Closing this one. [11:25:51] (03CR) 10Btullis: datahub: create dse-k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) (owner: 10Stevemunene) [11:26:37] (03CR) 10JMeybohm: [C:04-1] "this removes a bunch of ingress rules" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [11:28:41] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 9 DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [11:28:42] (03CR) 10Btullis: [C:03+2] Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [11:29:12] (03CR) 10JMeybohm: [C:04-1] admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [11:29:17] (03Merged) 10jenkins-bot: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [11:29:19] (03CR) 10Cathal Mooney: [C:03+1] hiera: add magru to wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/1024910 (https://phabricator.wikimedia.org/T362729) (owner: 10Ssingh) [11:30:33] (03CR) 10Cathal Mooney: [C:03+1] "LGTM but the suggestion to test given the number of changes is wise." [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1024838 (owner: 10Volans) [11:31:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 34.16% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:31:57] (ProbeDown) firing: (3) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:32:00] (03Abandoned) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [11:32:04] (03PS1) 10EoghanGaffney: apt-staging: Add dummy token for gitlab package puller [labs/private] - 10https://gerrit.wikimedia.org/r/1025327 [11:32:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1002.eqiad.wmnet with OS bullseye [11:32:43] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [11:32:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:32:58] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers parse1011.eqiad.wmnet, parse1013.eqiad.wmnet, kubernetes1041.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1419.eqiad.wmnet, mw1433.eqiad.wmnet, mw1479.eqiad.wmnet, kubernetes1023.eqiad.wmnet, mw1462.eqiad.wmnet, mw1388.eqiad.wmnet, kubernetes1047.eqiad.wmnet, mw1395.eqiad.wmnet, mw1454.eqiad.wmnet, parse1005.eqiad.wmnet, mw1408.eqi [11:32:58] , mw1425.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1033.eqiad.wmnet, kubernetes1018.eqiad.wmnet, mw1369.eqiad.wmnet, mw1367.eqiad.wmnet, mw1486.eqiad.wmnet, mw1458.eqiad.wmnet, parse1012.eqiad.wmnet, kubernetes1028.eqiad.wmnet, mw1464.eqiad.wmnet, parse1019.eqiad.wmnet, kubernetes1042.eqiad.wmnet, kubernetes1056.eqiad.wmnet, mw1352.eqiad.wmnet, mw1441.eqiad.wmnet, parse1006.eqiad.wmnet, kubernetes1039.eqiad.wmnet, mw1379.eqiad.wm [11:32:58] ernetes1026.eqiad.wmnet, mw1392.eqiad.wmnet, mw1368.eqiad.wmnet, mw1470.eqiad.wmnet, parse1014.eqiad.wmnet, parse1007.eqiad.wmnet, mw1432.eqiad.wmnet, kubernetes1022.eqiad.wmnet, kubern https://wikitech.wikimedia.org/wiki/PyBal [11:33:06] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers mw1492.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1479.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, mw1488.eqiad.wmnet, mw1370.eqiad.wmnet, mw1425.eqiad.wmnet, mw1395.eqiad.wmnet, mw1465.eqiad.wmnet, mw1483.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1356.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1028.eqiad.wmnet, [11:33:06] tes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, parse1019.eqiad.wmnet, mw1381.eqiad.wmnet, kubernetes1056.eqiad.wmnet, mw1352.eqiad.wmnet, mw1441.eqiad.wmnet, mw1472.eqiad.wmnet, parse1022.eqiad.wmnet, mw1451.eqiad.wmnet, mw1379.eqiad.wmnet, mw1491.eqiad.wmnet, kubernetes1023.eqiad.wmnet, parse1014.eqiad.wmnet, parse1007.eqiad.wmnet, mw1455.eqiad.wmnet, mw1475.eqiad.wmnet, mw1478.eqiad.wmnet, mw1349.eqiad.wmnet, mw1378.eqiad.wmnet, mw139 [11:33:06] wmnet, mw1482.eqiad.wmnet, kubernetes1040.eqiad.wmnet, mw1449.eqiad.wmnet, mw1461.eqiad.wmnet, parse1024.eqiad.wmnet, kubernetes1060.eqiad.wmnet, mw1467.eqiad.wmnet, mw1394.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [11:33:58] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:34:01] huh [11:34:06] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:34:14] (03CR) 10JMeybohm: [V:03+1] kubernetes::master: Add stacked control plane option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [11:34:16] em [11:34:33] I was going to ack it [11:34:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T361627)', diff saved to https://phabricator.wikimedia.org/P61389 and previous config saved to /var/cache/conftool/dbconfig/20240429-113445-marostegui.json [11:34:51] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:35:47] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db1180.eqiad.wmnet onto es1036.eqiad.wmnet [11:36:15] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.36% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:36:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61390 and previous config saved to /var/cache/conftool/dbconfig/20240429-113640-root.json [11:36:57] (ProbeDown) resolved: (3) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61391 and previous config saved to /var/cache/conftool/dbconfig/20240429-113728-root.json [11:37:43] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [11:37:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:38:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61392 and previous config saved to /var/cache/conftool/dbconfig/20240429-113850-root.json [11:42:00] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.98% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:51:45] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.13% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:52:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61393 and previous config saved to /var/cache/conftool/dbconfig/20240429-115234-root.json [11:56:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 37.41% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:01:14] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host cephosd1002.eqiad.wmnet [12:01:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61394 and previous config saved to /var/cache/conftool/dbconfig/20240429-120159-root.json [12:06:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:07:02] (03CR) 10EoghanGaffney: [V:03+2 C:03+2] apt-staging: Add dummy token for gitlab package puller [labs/private] - 10https://gerrit.wikimedia.org/r/1025327 (owner: 10EoghanGaffney) [12:07:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61395 and previous config saved to /var/cache/conftool/dbconfig/20240429-120740-root.json [12:10:00] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1002.eqiad.wmnet [12:10:03] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1024755 (https://phabricator.wikimedia.org/T363688) [12:10:30] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1024756 (https://phabricator.wikimedia.org/T363689) [12:10:35] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1024757 (https://phabricator.wikimedia.org/T363689) [12:12:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:12:49] (03CR) 10Btullis: global_config: add analytics mariadb/postgresql instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [12:14:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s4 T363688 [12:14:17] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:14:21] T363688: Switchover s4 master (db2179 -> db2140) - https://phabricator.wikimedia.org/T363688 [12:14:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s4 T363688 [12:14:50] (03CR) 10Brouberol: global_config: add analytics mariadb/postgresql instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [12:14:53] (03PS5) 10Brouberol: global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) [12:15:42] (03CR) 10Vgutierrez: [C:03+1] Revert "hiera: buffer memory limit increase for cp4037" [puppet] - 10https://gerrit.wikimedia.org/r/1024488 (owner: 10Fabfur) [12:16:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2140 with weight 0 T363688', diff saved to https://phabricator.wikimedia.org/P61396 and previous config saved to /var/cache/conftool/dbconfig/20240429-121559-arnaudb.json [12:17:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61397 and previous config saved to /var/cache/conftool/dbconfig/20240429-121704-root.json [12:17:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:17:19] (03CR) 10Btullis: global_config: add analytics mariadb/postgresql instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [12:18:17] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:18:37] (03CR) 10MVernon: cephadm: new modules, profile, roles for cephadm-based Ceph clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [12:20:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:20:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:22:29] (03CR) 10Brouberol: global_config: add analytics mariadb/postgresql instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [12:22:32] (03PS6) 10Brouberol: global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) [12:22:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61398 and previous config saved to /var/cache/conftool/dbconfig/20240429-122246-root.json [12:23:18] (03PS1) 10Elukey: ml-services: update Docker image for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025344 [12:24:31] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:24:51] (03CR) 10Vgutierrez: purged: add PKI cert handling (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [12:25:09] (03PS1) 10Btullis: Avoid divide by zero errors if ceph_disks fact is not yet available [puppet] - 10https://gerrit.wikimedia.org/r/1025345 (https://phabricator.wikimedia.org/T362993) [12:25:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:25:30] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update Docker image for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025344 (owner: 10Elukey) [12:25:49] (03CR) 10Elukey: [C:03+2] ml-services: update Docker image for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025344 (owner: 10Elukey) [12:26:06] (03CR) 10Btullis: [C:03+2] Avoid divide by zero errors if ceph_disks fact is not yet available [puppet] - 10https://gerrit.wikimedia.org/r/1025345 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [12:28:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.38% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:29:22] (03CR) 10Elukey: "Hi Keith! Could you expand a little what is the rationale for this change?" [puppet] - 10https://gerrit.wikimedia.org/r/1024790 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [12:30:56] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:31:07] (03PS2) 10Elukey: role::restbase::production: move Cassandra codfw instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1024736 (https://phabricator.wikimedia.org/T352647) [12:31:07] (03PS2) 10Elukey: role::restbase::production: move eqiad Cassandra instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1024737 (https://phabricator.wikimedia.org/T352647) [12:31:07] (03PS2) 10Elukey: role::restbase::production: cleanup after PKI migration [puppet] - 10https://gerrit.wikimedia.org/r/1024738 (https://phabricator.wikimedia.org/T352647) [12:31:08] (03PS4) 10Elukey: role::sessionstore: upgrade the Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647) [12:31:09] (03CR) 10Alexandros Kosiaris: [C:03+1] Enable async upload-by-URL on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [12:31:56] (03CR) 10Btullis: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [12:32:06] (03PS5) 10Effie Mouzeli: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [12:32:11] (03CR) 10Brouberol: [C:03+2] global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [12:32:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61399 and previous config saved to /var/cache/conftool/dbconfig/20240429-123210-root.json [12:32:23] (03CR) 10Brouberol: [C:03+2] global_config: add analytics mariadb/postgresql instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [12:33:30] headsup, I'm going to deploy admin_ng on all kubernetes clusters to create 2 new external-services (mariadb-analytics-meta and postgresql-analytics) [12:34:51] (03PS3) 10Santiago Faci: Creating staging and production helmfiles for MPIC (Metrics Platform Instrument Configurator) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) [12:35:10] PROBLEM - Disk space on prometheus2006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s 50511 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus2006&var-datasource=codfw+prometheus/ops [12:36:17] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:36:46] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1003.eqiad.wmnet with OS bullseye [12:36:51] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1024755 (https://phabricator.wikimedia.org/T363688) (owner: 10Gerrit maintenance bot) [12:36:54] (03PS8) 10Cyndywikime: Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 [12:37:20] !log Starting s4 codfw failover from db2179 to db2140 - T363688 [12:37:44] (03CR) 10Elukey: [C:03+2] role::restbase::production: move Cassandra codfw instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1024736 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [12:37:49] !log brouberol@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:38:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 35.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:38:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2140 to s4 primary T363688', diff saved to https://phabricator.wikimedia.org/P61400 and previous config saved to /var/cache/conftool/dbconfig/20240429-123840-arnaudb.json [12:38:52] !log brouberol@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:40:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2179 T363688', diff saved to https://phabricator.wikimedia.org/P61401 and previous config saved to /var/cache/conftool/dbconfig/20240429-124048-arnaudb.json [12:41:54] (03CR) 10Alexandros Kosiaris: [C:04-1] "Some minor comments internally, but there is a larger coding pattern I 'd rather see address. It becomes evident when one profile includes" [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:41:54] !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:42:49] (03CR) 10Ladsgroup: [C:03+2] "it's beta cluster. It can in at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [12:43:03] (03PS6) 10Effie Mouzeli: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [12:43:12] !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:43:30] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:43:37] (03Merged) 10jenkins-bot: Enable async upload-by-URL on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [12:43:54] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:43:55] (03CR) 10Ayounsi: [C:03+1] hiera: add magru to wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/1024910 (https://phabricator.wikimedia.org/T362729) (owner: 10Ssingh) [12:43:56] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:44:11] !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:44:26] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2022.codfw.wmnet: Move to PKI TLS certs - elukey@cumin1002 [12:44:27] hnowlan: I rebased it on deploy1002, it'll be live in beta cluster in ten minutes or so. Have fun. [12:44:33] !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:44:59] !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:45:33] !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:46:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db2179.codfw.wmnet with reason: T362746 [12:46:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2179.codfw.wmnet with reason: T362746 [12:46:55] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:47:15] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:47:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61402 and previous config saved to /var/cache/conftool/dbconfig/20240429-124716-root.json [12:47:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2179.codfw.wmnet with OS bookworm [12:47:51] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:48:11] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:48:53] (ProbeDown) firing: (2) Service restbase2022-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:48:53] !log brouberol@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:49:24] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 69 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:49:28] !log brouberol@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:50:15] (03CR) 10Btullis: elasticsearch: Configure alerts for short-lived certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking) [12:50:26] (ProbeDown) resolved: (2) Service restbase2022-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 35.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:52:08] elukey: there's a non negligible diff for admin_ng ml-serve-{eqiad,codfw}. Can you tell me if it's safe to proceed? Thanks! [12:52:50] brouberol: o/ nope not safe, we are doing a big istio config refactoring, I'll sync admin-ng when we are ready, is it a problem? [12:53:33] (03CR) 10Ssingh: [C:03+2] hiera: add magru to wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/1024910 (https://phabricator.wikimedia.org/T362729) (owner: 10Ssingh) [12:53:41] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2022.codfw.wmnet: Move to PKI TLS certs - elukey@cumin1002 [12:53:44] no, not really. I've pushed some new external-services config, and I intended to deploy it everywhere, to avoid drift. That being said, it shouldn't be needed in the ml clusters, so it'll get deployed with your change when it's ready [12:54:08] PROBLEM - cassandra-c SSL 10.192.32.193:7000 on restbase2022 is CRITICAL: SSL CRITICAL - failed to verify restbase2022-c against restbase2022-c.codfw.wmnet, cassandra, restbase2022.codfw.wmnet:Certificate restbase2022-c.codfw.wmnet valid until 2024-05-27 12:36:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [12:54:16] PROBLEM - cassandra-b SSL 10.192.32.192:7000 on restbase2022 is CRITICAL: SSL CRITICAL - failed to verify restbase2022-b against restbase2022-b.codfw.wmnet, cassandra, restbase2022.codfw.wmnet:Certificate restbase2022-b.codfw.wmnet valid until 2024-05-27 12:36:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [12:54:16] PROBLEM - cassandra-a SSL 10.192.32.191:7000 on restbase2022 is CRITICAL: SSL CRITICAL - failed to verify restbase2022-a against restbase2022-a.codfw.wmnet, cassandra, restbase2022.codfw.wmnet:Certificate restbase2022-a.codfw.wmnet valid until 2024-05-27 12:36:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [12:54:48] brouberol: ack! [12:54:52] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage [12:55:30] so the above alerts for restbase2022 are old nagios alerts, that are now firing because probably icinga needs to be synced [12:55:41] the instances are all up [12:56:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:57:21] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage [12:58:36] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[23-35]*: Roll out PKI TLS certs - elukey@cumin1002 [12:58:43] (03PS4) 10Santiago Faci: Creating staging and production helmfiles for MPIC (Metrics Platform Instrument Configurator) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) [12:59:05] there may be other alerts going to fire, I'll double check every time [12:59:25] (03PS1) 10Ssingh: magru: add DNS boxes dns700[12] [puppet] - 10https://gerrit.wikimedia.org/r/1025354 (https://phabricator.wikimedia.org/T346722) [12:59:38] (03CR) 10Santiago Faci: "Could you take a look at the changes regarding your suggestion?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) (owner: 10Santiago Faci) [12:59:41] (03PS2) 10Ssingh: magru: add DNS boxes dns700[12] [puppet] - 10https://gerrit.wikimedia.org/r/1025354 (https://phabricator.wikimedia.org/T346722) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T1300). [13:00:05] hnowlan and dcausse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:49] (03CR) 10Brouberol: [C:03+2] global_config: add analytics mariadb/postgresql instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [13:00:56] (03CR) 10Fabfur: [C:03+2] Revert "hiera: buffer memory limit increase for cp4037" [puppet] - 10https://gerrit.wikimedia.org/r/1024488 (owner: 10Fabfur) [13:01:00] o/ [13:01:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 35.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:01:36] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:01:54] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:01:58] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:02:12] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1087.eqiad.wmnet [13:02:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61403 and previous config saved to /var/cache/conftool/dbconfig/20240429-130222-root.json [13:03:06] I can deploy [13:03:17] hnowlan: will start with your change [13:03:22] (03PS3) 10Hnowlan: CommonSettings: change jobrunner xff to mw-jobrunner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 [13:04:36] (03PS4) 10Fabfur: benthos:haproxy_cache: pass root cas file path as envvar [puppet] - 10https://gerrit.wikimedia.org/r/1023879 (https://phabricator.wikimedia.org/T358109) [13:04:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2179.codfw.wmnet with reason: host reimage [13:04:52] dcausse: ty - it's just a logging change that won't really have impact until it is rolled out [13:05:14] ok, so not testable on mwdebug? [13:05:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan) [13:05:55] dcausse: sadly no [13:06:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:06:27] (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 (owner: 10Cyndywikime) [13:06:51] (03Merged) 10jenkins-bot: CommonSettings: change jobrunner xff to mw-jobrunner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan) [13:06:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'fix weights', diff saved to https://phabricator.wikimedia.org/P61405 and previous config saved to /var/cache/conftool/dbconfig/20240429-130652-arnaudb.json [13:07:05] (03CR) 10Brouberol: [C:03+1] Creating staging and production helmfiles for MPIC (Metrics Platform Instrument Configurator) (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) (owner: 10Santiago Faci) [13:07:06] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:1020277|CommonSettings: change jobrunner xff to mw-jobrunner]] [13:07:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:07:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2179.codfw.wmnet with reason: host reimage [13:08:08] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1087.eqiad.wmnet [13:09:06] (03Restored) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [13:09:24] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 22 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:09:45] !log dcausse@deploy1002 hnowlan and dcausse: Backport for [[gerrit:1020277|CommonSettings: change jobrunner xff to mw-jobrunner]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:09:50] !log dcausse@deploy1002 hnowlan and dcausse: Continuing with sync [13:12:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:12:26] (03CR) 10Fabfur: [C:03+2] benthos:haproxy_cache: pass root cas file path as envvar [puppet] - 10https://gerrit.wikimedia.org/r/1023879 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:13:03] (03PS1) 10Marostegui: es1038: Make it es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1025356 (https://phabricator.wikimedia.org/T355285) [13:13:44] (03CR) 10Marostegui: [C:03+2] es1038: Make it es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1025356 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [13:13:56] PROBLEM - Check whether ferm is active by checking the default input chain on mw1431 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:14:12] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2039 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:14:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:14:58] PROBLEM - Check whether ferm is active by checking the default input chain on parse2007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:15:46] (03PS1) 10Ssingh: P:cache::varnish::frontend: set esitest to absent by default [puppet] - 10https://gerrit.wikimedia.org/r/1025357 (https://phabricator.wikimedia.org/T308799) [13:16:32] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:17:09] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2171/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025357 (https://phabricator.wikimedia.org/T308799) (owner: 10Ssingh) [13:17:28] PROBLEM - Check whether ferm is active by checking the default input chain on parse1014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:17:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61406 and previous config saved to /var/cache/conftool/dbconfig/20240429-131728-root.json [13:17:41] (03CR) 10EoghanGaffney: [C:03+2] [apt-staging] Package puller updates [puppet] - 10https://gerrit.wikimedia.org/r/1021948 (owner: 10EoghanGaffney) [13:17:52] PROBLEM - Check whether ferm is active by checking the default input chain on mw2421 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:18:24] (03CR) 10Vgutierrez: benthos:haproxy_cache: field renaming moved to grok pattern (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1023969 (https://phabricator.wikimedia.org/T363420) (owner: 10Fabfur) [13:18:27] (03PS1) 10EoghanGaffney: apt-staging: Add access token for gitlab package puller [puppet] - 10https://gerrit.wikimedia.org/r/1025358 [13:18:53] (ProbeDown) firing: (66) Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:20:26] (ProbeDown) firing: (66) Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:20:33] 06SRE-OnFire, 06Discovery-Search, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694 (10bking) 03NEW [13:21:08] (03CR) 10Vgutierrez: [C:03+1] "thx for taking care of this one <3" [puppet] - 10https://gerrit.wikimedia.org/r/1025357 (https://phabricator.wikimedia.org/T308799) (owner: 10Ssingh) [13:21:58] PROBLEM - PyBal connections to etcd on lvs2014 is CRITICAL: CRITICAL: 97 connections established with conf2004.codfw.wmnet:4001 (min=98) https://wikitech.wikimedia.org/wiki/PyBal [13:22:06] hmm? [13:23:16] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1020277|CommonSettings: change jobrunner xff to mw-jobrunner]] (duration: 16m 10s) [13:23:40] hnowlan: your change should be live [13:23:53] (ProbeDown) firing: (66) Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:24:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1003.eqiad.wmnet with OS bullseye [13:24:10] RECOVERY - PyBal connections to etcd on lvs2014 is OK: OK: 98 connections established with conf2004.codfw.wmnet:4001 (min=98) https://wikitech.wikimedia.org/wiki/PyBal [13:24:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 35.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:24:22] 06SRE-OnFire, 06Discovery-Search, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694#9753023 (10bking) [13:24:41] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2172/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025358 (owner: 10EoghanGaffney) [13:24:43] shipping my patch now [13:25:04] (03CR) 10EoghanGaffney: apt-staging: Add access token for gitlab package puller [puppet] - 10https://gerrit.wikimedia.org/r/1025358 (owner: 10EoghanGaffney) [13:25:10] sukhe: a few etcd reconnections.. I guess that the check hit lvs2014 between those [13:25:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025176 (https://phabricator.wikimedia.org/T363516) (owner: 10DCausse) [13:25:48] yeah [13:26:20] !log sudo cumin "A:cp-text" "disable-puppet 'merging CR 1025357'" [13:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:52] (03Merged) 10jenkins-bot: Revert "cirrus: Shift autocomplete traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025176 (https://phabricator.wikimedia.org/T363516) (owner: 10DCausse) [13:27:10] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:1025176|Revert "cirrus: Shift autocomplete traffic to codfw" (T363516)]] [13:27:11] (03CR) 10Alexandros Kosiaris: [C:04-1] "2 minor comments, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [13:27:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:27:21] T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516 [13:28:53] (ProbeDown) firing: (66) Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:01] (03CR) 10Brouberol: [C:03+1] "Looks good, thanks Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/1024625 (owner: 10Muehlenhoff) [13:29:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2179.codfw.wmnet with OS bookworm [13:29:39] !log dcausse@deploy1002 dcausse: Backport for [[gerrit:1025176|Revert "cirrus: Shift autocomplete traffic to codfw" (T363516)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:30:26] (ProbeDown) firing: (66) Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:32] (03CR) 10AOkoth: [C:03+1] Deprecate system::role for Collaboration services (batch two) [puppet] - 10https://gerrit.wikimedia.org/r/1024615 (owner: 10Muehlenhoff) [13:30:47] !log dcausse@deploy1002 dcausse: Continuing with sync [13:32:26] (03PS5) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) [13:32:27] (03PS1) 10JMeybohm: etcd::v3: Allow prometheus nodes to scrape etcd [puppet] - 10https://gerrit.wikimedia.org/r/1025362 [13:32:54] (03CR) 10CI reject: [V:04-1] kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [13:32:55] (03PS1) 10Marostegui: mariadb: Remove comments from es6 [puppet] - 10https://gerrit.wikimedia.org/r/1025363 (https://phabricator.wikimedia.org/T355285) [13:33:35] jouncebot: nowandnext [13:33:35] For the next 0 hour(s) and 26 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T1300) [13:33:35] In 1 hour(s) and 56 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T1530) [13:33:47] (03CR) 10Reedy: [C:03+2] Fix for encoded characters in resource attribute [extensions/TimedMediaHandler] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1024714 (https://phabricator.wikimedia.org/T363550) (owner: 10Jforrester) [13:33:53] (ProbeDown) firing: (66) Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:33:53] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:33:59] (03CR) 10Marostegui: [C:03+2] mariadb: Remove comments from es6 [puppet] - 10https://gerrit.wikimedia.org/r/1025363 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [13:36:20] (03Abandoned) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025295 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [13:36:29] PROBLEM - Check whether ferm is active by checking the default input chain on mw1478 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:36:44] (03CR) 10Ssingh: [V:03+1 C:03+2] P:cache::varnish::frontend: set esitest to absent by default [puppet] - 10https://gerrit.wikimedia.org/r/1025357 (https://phabricator.wikimedia.org/T308799) (owner: 10Ssingh) [13:37:08] (03PS4) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) [13:37:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61407 and previous config saved to /var/cache/conftool/dbconfig/20240429-133736-arnaudb.json [13:37:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2213.codfw.wmnet with reason: Maintenance [13:37:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2213.codfw.wmnet with reason: Maintenance [13:38:53] (ProbeDown) firing: (64) Service restbase2025-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:11] (03CR) 10JMeybohm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [13:39:55] PROBLEM - Check whether ferm is active by checking the default input chain on parse1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:40:02] !log sudo cumin -b1 -s10 "A:cp-text" "run-puppet-agent --enable 'merging CR 1025357'" [13:40:04] (03CR) 10CI reject: [V:04-1] admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [13:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:14] (03PS5) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) [13:40:17] PROBLEM - Check whether ferm is active by checking the default input chain on parse1024 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:40:18] (03PS6) 10JMeybohm: kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) [13:40:23] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025362 (owner: 10JMeybohm) [13:40:26] (ProbeDown) firing: (63) Service restbase2025-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [13:40:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [13:40:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:41:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:41:14] belated thanks dcausse! [13:41:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T361627)', diff saved to https://phabricator.wikimedia.org/P61408 and previous config saved to /var/cache/conftool/dbconfig/20240429-134115-marostegui.json [13:41:24] (03CR) 10Stevemunene: [C:03+1] Harmonise analytics Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1024625 (owner: 10Muehlenhoff) [13:41:49] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:42:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:43:13] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1025176|Revert "cirrus: Shift autocomplete traffic to codfw" (T363516)]] (duration: 16m 02s) [13:43:19] T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516 [13:43:29] (03CR) 10CI reject: [V:04-1] admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [13:43:47] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9753177 (10ssingh) [13:43:49] 10ops-magru, 13Patch-For-Review: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722#9753178 (10ssingh) [13:43:57] RECOVERY - Check whether ferm is active by checking the default input chain on mw1431 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:44:13] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2039 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:44:34] (03PS2) 10Fabfur: benthos:haproxy_cache: field renaming moved to grok pattern [puppet] - 10https://gerrit.wikimedia.org/r/1023969 (https://phabricator.wikimedia.org/T363420) [13:44:56] Reedy: you'll take care of the deploy of https://gerrit.wikimedia.org/r/1024714 [13:44:58] ? [13:44:59] RECOVERY - Check whether ferm is active by checking the default input chain on parse2007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:45:02] dcausse: Yeah [13:45:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T361627)', diff saved to https://phabricator.wikimedia.org/P61409 and previous config saved to /var/cache/conftool/dbconfig/20240429-134507-marostegui.json [13:45:14] ok, then I'm done :) [13:46:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.38% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:46:31] (03PS6) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) [13:46:33] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:46:58] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [13:47:14] (03CR) 10Fabfur: benthos:haproxy_cache: field renaming moved to grok pattern (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1023969 (https://phabricator.wikimedia.org/T363420) (owner: 10Fabfur) [13:47:29] RECOVERY - Check whether ferm is active by checking the default input chain on parse1014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:47:36] (03PS2) 10Majavah: hieradata: move cloudvirt2002-dev to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1023384 (https://phabricator.wikimedia.org/T358761) [13:47:38] (03PS1) 10Btullis: Fix the ceph osd activate command for hdd devices [puppet] - 10https://gerrit.wikimedia.org/r/1025365 (https://phabricator.wikimedia.org/T362993) [13:47:53] RECOVERY - Check whether ferm is active by checking the default input chain on mw2421 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:48:25] (03CR) 10Majavah: [C:03+2] hieradata: move cloudvirt2002-dev to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1023384 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [13:48:53] (ProbeDown) firing: (57) Service restbase2026-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:39] (03CR) 10CI reject: [V:04-1] admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [13:49:41] (03CR) 10Btullis: [C:03+2] Fix the ceph osd activate command for hdd devices [puppet] - 10https://gerrit.wikimedia.org/r/1025365 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [13:50:53] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt2002-dev.codfw.wmnet with OS bookworm [13:51:26] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [13:51:38] (03PS1) 10Ssingh: geo-maps: define initial mapping for South America (magru) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) [13:52:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61410 and previous config saved to /var/cache/conftool/dbconfig/20240429-135241-arnaudb.json [13:53:42] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1004.eqiad.wmnet with OS bullseye [13:54:07] (03CR) 10Alexandros Kosiaris: [C:03+1] aptrepo: Add new repository component and repo sync config for Node 20 [puppet] - 10https://gerrit.wikimedia.org/r/1024663 (https://phabricator.wikimedia.org/T362681) (owner: 10Muehlenhoff) [13:55:26] (ProbeDown) firing: (54) Service restbase2027-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:58:50] (03Merged) 10jenkins-bot: Fix for encoded characters in resource attribute [extensions/TimedMediaHandler] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1024714 (https://phabricator.wikimedia.org/T363550) (owner: 10Jforrester) [13:58:53] (ProbeDown) firing: (49) Service restbase2027-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:58:54] (03CR) 10Herron: "Sure, my thinking is since we changed the thanos rule label configuration somewhat significantly (site/replica labels pass thru now) it'd " [puppet] - 10https://gerrit.wikimedia.org/r/1024790 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [13:59:17] 06SRE-OnFire, 06Discovery-Search, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694#9753296 (10bking) [13:59:28] (03CR) 10Alexandros Kosiaris: [C:03+1] etcd::v3: Allow prometheus nodes to scrape etcd [puppet] - 10https://gerrit.wikimedia.org/r/1025362 (owner: 10JMeybohm) [13:59:58] (03CR) 10JMeybohm: [V:03+1] kubernetes::master: Add stacked control plane option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [14:00:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P61411 and previous config saved to /var/cache/conftool/dbconfig/20240429-140015-marostegui.json [14:01:07] (03CR) 10JMeybohm: [V:03+1 C:03+2] etcd::v3: Allow prometheus nodes to scrape etcd [puppet] - 10https://gerrit.wikimedia.org/r/1025362 (owner: 10JMeybohm) [14:01:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:02:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 35.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:02:19] (03CR) 10Ayounsi: geo-maps: define initial mapping for South America (magru) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:03:41] 06SRE-OnFire, 06Discovery-Search, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694#9753318 (10bking) [14:03:53] (ProbeDown) firing: (48) Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:33] (03CR) 10Ssingh: geo-maps: define initial mapping for South America (magru) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:04:42] (03PS7) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) [14:05:26] (ProbeDown) firing: (46) Service restbase2028-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:05:51] 06SRE-OnFire, 06Discovery-Search, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694#9753324 (10bking) [14:06:29] RECOVERY - Check whether ferm is active by checking the default input chain on mw1478 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:07:01] 06SRE, 10Cumin, 06Infrastructure-Foundations: Feature request: When cumin is running with -b (and -s), it should display the current host being affected - https://phabricator.wikimedia.org/T355811#9753327 (10ssingh) Sorry, I forgot to reply to this! >>! In T355811#9717278, @Volans wrote: > I see only one ca... [14:07:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61412 and previous config saved to /var/cache/conftool/dbconfig/20240429-140748-arnaudb.json [14:08:53] (ProbeDown) firing: (42) Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:25] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2060 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:09:33] (03PS1) 10Marostegui: es2035: Make it es6 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/1025370 (https://phabricator.wikimedia.org/T355424) [14:09:55] RECOVERY - Check whether ferm is active by checking the default input chain on parse1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:10:40] (03CR) 10Ayounsi: geo-maps: define initial mapping for South America (magru) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:11:20] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [14:11:40] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage [14:12:51] (03CR) 10Marostegui: [C:03+2] es2035: Make it es6 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/1025370 (https://phabricator.wikimedia.org/T355424) (owner: 10Marostegui) [14:13:02] (03CR) 10Ssingh: geo-maps: define initial mapping for South America (magru) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:13:53] (ProbeDown) firing: (41) Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:27] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [14:14:30] !log reedy@deploy1002 Synchronized php-1.43.0-wmf.2/extensions/TimedMediaHandler/: T363550 (duration: 14m 42s) [14:14:35] T363550: Can't play video thumbnail if the filename contains an apostrophe - https://phabricator.wikimedia.org/T363550 [14:15:09] PROBLEM - Disk space on prometheus2006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s 49420 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus2006&var-datasource=codfw+prometheus/ops [14:15:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P61413 and previous config saved to /var/cache/conftool/dbconfig/20240429-141523-marostegui.json [14:15:26] (ProbeDown) firing: (39) Service restbase2029-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:56] elukey: ^ ? [14:16:29] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:17:04] jayme: yes yes all "good", part of the move is to move nagios alerts to prometheus blackbox, and there are alerts where puppet ran but the cookbook haven't restarted the instances yet (so no new TLS cert etc..) [14:17:07] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage [14:17:16] *hosts not alerts [14:17:36] all instances are up, monitoring it via nodetool [14:17:49] (thanks for the ping) [14:18:02] ah, okidoke [14:18:55] (03CR) 10Vgutierrez: [C:03+1] benthos:haproxy_cache: field renaming moved to grok pattern (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023969 (https://phabricator.wikimedia.org/T363420) (owner: 10Fabfur) [14:21:28] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:22:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61414 and previous config saved to /var/cache/conftool/dbconfig/20240429-142254-arnaudb.json [14:23:53] (ProbeDown) firing: (34) Service restbase2030-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:18] (03CR) 10Ssingh: geo-maps: define initial mapping for South America (magru) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:24:35] (03CR) 10JMeybohm: [C:04-1] admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [14:24:43] (03PS1) 10Jgiannelos: wikifeeds: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025377 [14:25:26] (ProbeDown) firing: (31) Service restbase2030-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:34] 06SRE-OnFire, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 03Discovery-Search (Current work), 10Sustainability (Incident Followup): Post incident tasks: Search missing results/unavailable for some eqiad users - https://phabricator.wikimedia.org/T363694#9753421 (10Gehel) [14:27:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 37.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:29:14] (03CR) 10Jgiannelos: [C:03+2] wikifeeds: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025377 (owner: 10Jgiannelos) [14:30:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 36.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:30:18] (03Merged) 10jenkins-bot: wikifeeds: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025377 (owner: 10Jgiannelos) [14:30:26] (ProbeDown) firing: (30) Service restbase2031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T361627)', diff saved to https://phabricator.wikimedia.org/P61415 and previous config saved to /var/cache/conftool/dbconfig/20240429-143030-marostegui.json [14:30:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1183.eqiad.wmnet with reason: Maintenance [14:30:39] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:30:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1183.eqiad.wmnet with reason: Maintenance [14:30:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T361627)', diff saved to https://phabricator.wikimedia.org/P61416 and previous config saved to /var/cache/conftool/dbconfig/20240429-143053-marostegui.json [14:30:59] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9753457 (10Volans) p:05Triage→03Medium [14:32:31] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1004.eqiad.wmnet with OS bullseye [14:33:30] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 50 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:33:35] 06SRE, 06Infrastructure-Foundations: puppetserver1001.eqiad.wmnet is unresponsive - https://phabricator.wikimedia.org/T363615#9753470 (10Volans) p:05Triage→03Medium [14:33:53] (ProbeDown) firing: (25) Service restbase2031-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T361627)', diff saved to https://phabricator.wikimedia.org/P61417 and previous config saved to /var/cache/conftool/dbconfig/20240429-143444-marostegui.json [14:35:22] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [14:36:25] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [14:36:29] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [14:37:01] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [14:38:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61418 and previous config saved to /var/cache/conftool/dbconfig/20240429-143800-arnaudb.json [14:38:18] !log add 120G to prometheus/k8s in codfw [14:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:53] (ProbeDown) firing: (24) Service restbase2032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:53] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:24] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2060 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:40:18] RECOVERY - Check whether ferm is active by checking the default input chain on parse1024 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:40:26] (ProbeDown) firing: (23) Service restbase2032-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:41:18] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2002-dev.codfw.wmnet with OS bookworm [14:43:53] (ProbeDown) firing: (18) Service restbase2033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:44:03] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2175/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025358 (owner: 10EoghanGaffney) [14:44:42] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1025358 (owner: 10EoghanGaffney) [14:48:00] RECOVERY - Disk space on prometheus2005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus2005&var-datasource=codfw+prometheus/ops [14:48:53] (ProbeDown) firing: (17) Service restbase2033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P61420 and previous config saved to /var/cache/conftool/dbconfig/20240429-144951-marostegui.json [14:50:26] (ProbeDown) resolved: (15) Service restbase2033-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:50:27] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1024758 (https://phabricator.wikimedia.org/T363713) [14:51:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T363713 [14:51:48] (03PS8) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) [14:51:52] T363713: Switchover s6 master (db2114 -> db2129) - https://phabricator.wikimedia.org/T363713 [14:52:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2129 with weight 0 T363713', diff saved to https://phabricator.wikimedia.org/P61421 and previous config saved to /var/cache/conftool/dbconfig/20240429-145203-arnaudb.json [14:52:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T363713 [14:53:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61422 and previous config saved to /var/cache/conftool/dbconfig/20240429-145306-arnaudb.json [14:54:05] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[23-35]*: Roll out PKI TLS certs - elukey@cumin1002 [14:55:10] RECOVERY - Disk space on prometheus2006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus2006&var-datasource=codfw+prometheus/ops [14:58:53] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:39] !log robh@cumin1002 START - Cookbook sre.dns.netbox [14:59:47] (03PS1) 10Andrew Bogott: Update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/1025385 [15:01:34] !log robh@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7002 setup - robh@cumin1002" [15:02:25] !log robh@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7002 setup - robh@cumin1002" [15:02:25] !log robh@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:03:26] (03PS3) 10Elukey: role::restbase::production: move eqiad Cassandra instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1024737 (https://phabricator.wikimedia.org/T352647) [15:03:31] !log robh@cumin1002 START - Cookbook sre.hosts.provision for host cp7002.mgmt.magru.wmnet with reboot policy FORCED [15:03:43] (03PS9) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) [15:05:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P61423 and previous config saved to /var/cache/conftool/dbconfig/20240429-150459-marostegui.json [15:07:14] (03CR) 10Fabfur: [C:03+2] benthos:haproxy_cache: field renaming moved to grok pattern [puppet] - 10https://gerrit.wikimedia.org/r/1023969 (https://phabricator.wikimedia.org/T363420) (owner: 10Fabfur) [15:08:07] (03CR) 10Fabfur: [C:03+2] benthos:haproxy_cache: field renaming moved to grok pattern (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023969 (https://phabricator.wikimedia.org/T363420) (owner: 10Fabfur) [15:08:44] (03CR) 10Ssingh: "BHZ" [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [15:09:11] (03CR) 10Andrew Bogott: [C:03+2] Update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/1025385 (owner: 10Andrew Bogott) [15:09:28] (03CR) 10Ssingh: "Sorry, incomplete comment but I wanted to mention the city of BHZ here." [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [15:09:56] (03CR) 10Elukey: [C:03+2] role::restbase::production: move eqiad Cassandra instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1024737 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:10:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.38% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:12:26] (03PS2) 10Scott French: kubernetes: add usernames for commons-impact-analytics to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1023959 (https://phabricator.wikimedia.org/T361835) [15:12:26] (03PS2) 10Scott French: DNM: cassandra: add commons_impact_analytics user [puppet] - 10https://gerrit.wikimedia.org/r/1023960 (https://phabricator.wikimedia.org/T361835) [15:12:54] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9753628 (10andrea.denisse) [15:14:36] !log robh@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7002.mgmt.magru.wmnet with reboot policy FORCED [15:15:55] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1028.eqiad.wmnet: Move to PKI TLS certs - elukey@cumin1002 [15:15:55] (03PS1) 10Andrew Bogott: Horizon local settings: update for django 4 [puppet] - 10https://gerrit.wikimedia.org/r/1025389 [15:16:28] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:16:32] (03CR) 10Andrew Bogott: [C:03+2] Horizon local settings: update for django 4 [puppet] - 10https://gerrit.wikimedia.org/r/1025389 (owner: 10Andrew Bogott) [15:18:01] (03CR) 10Scott French: [C:03+2] kubernetes: add usernames for commons-impact-analytics to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1023959 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:18:20] (03PS1) 10Hnowlan: Include mw-jobrunner port in host header check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025391 [15:18:28] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 31 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:19:19] !log robh@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7002'] [15:20:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T361627)', diff saved to https://phabricator.wikimedia.org/P61424 and previous config saved to /var/cache/conftool/dbconfig/20240429-152006-marostegui.json [15:20:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [15:20:12] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:20:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [15:20:26] (ProbeDown) firing: Service restbase1028-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1028-c:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T361627)', diff saved to https://phabricator.wikimedia.org/P61425 and previous config saved to /var/cache/conftool/dbconfig/20240429-152029-marostegui.json [15:20:50] (03CR) 10Effie Mouzeli: [C:03+1] Include mw-jobrunner port in host header check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025391 (owner: 10Hnowlan) [15:22:40] (03PS9) 10Esanders: Turn off DiscussionTools A/B test, and enable features on those wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) [15:23:05] !log robh@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp7002'] [15:23:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T361627)', diff saved to https://phabricator.wikimedia.org/P61426 and previous config saved to /var/cache/conftool/dbconfig/20240429-152314-marostegui.json [15:23:50] (03PS2) 10Scott French: admin_ng: add namespace for commons-impact-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023956 (https://phabricator.wikimedia.org/T361835) [15:23:51] (03PS2) 10Scott French: DNM: services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) [15:23:51] (03PS2) 10Scott French: DNM: rest-gateway: route commons-analytics via rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023958 (https://phabricator.wikimedia.org/T361835) [15:23:53] (ProbeDown) resolved: (2) Service restbase1028-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:24:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:24:39] (03CR) 10CI reject: [V:04-1] DNM: services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:24:44] (03CR) 10CI reject: [V:04-1] DNM: rest-gateway: route commons-analytics via rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023958 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:25:44] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1028.eqiad.wmnet: Move to PKI TLS certs - elukey@cumin1002 [15:25:54] PROBLEM - cassandra-c SSL 10.64.0.211:7000 on restbase1028 is CRITICAL: SSL CRITICAL - failed to verify restbase1028-c against restbase1028-c.eqiad.wmnet, cassandra, restbase1028.eqiad.wmnet:Certificate restbase1028-c.eqiad.wmnet valid until 2024-05-27 15:09:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:25:54] PROBLEM - cassandra-a SSL 10.64.0.209:7000 on restbase1028 is CRITICAL: SSL CRITICAL - failed to verify restbase1028-a against restbase1028-a.eqiad.wmnet, cassandra, restbase1028.eqiad.wmnet:Certificate restbase1028-a.eqiad.wmnet valid until 2024-05-27 15:09:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:25:56] PROBLEM - cassandra-b SSL 10.64.0.210:7000 on restbase1028 is CRITICAL: SSL CRITICAL - failed to verify restbase1028-b against restbase1028-b.eqiad.wmnet, cassandra, restbase1028.eqiad.wmnet:Certificate restbase1028-b.eqiad.wmnet valid until 2024-05-27 15:09:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:26:33] these are the old nagios alerts --^ [15:26:48] (03CR) 10JMeybohm: [V:03+1 C:03+2] kubernetes::master: Add stacked control plane option [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [15:27:11] (03CR) 10Scott French: [C:03+2] admin_ng: add namespace for commons-impact-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023956 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:28:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2151', diff saved to https://phabricator.wikimedia.org/P61427 and previous config saved to /var/cache/conftool/dbconfig/20240429-152809-arnaudb.json [15:29:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T1530). [15:30:12] (03Merged) 10jenkins-bot: admin_ng: add namespace for commons-impact-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023956 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:30:25] o/ starting portals deploy [15:30:30] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:31:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:32:04] !log swfrench@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:32:14] (03CR) 10JMeybohm: [V:03+1 C:03+2] kubernetes::master: Add stacked control plane option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025278 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [15:34:27] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025396 (https://phabricator.wikimedia.org/T128546) [15:34:40] !log swfrench@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:34:50] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025396 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:35:00] !log root@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[29-42]*: Move Cassandra to PKI - root@cumin1002 [15:35:17] !log swfrench@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:35:45] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025396 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:36:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 39.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:36:37] !log swfrench@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:37:04] (03PS1) 10JMeybohm: Add kubestage2003 to staging-codfw and conftool [puppet] - 10https://gerrit.wikimedia.org/r/1025397 (https://phabricator.wikimedia.org/T363307) [15:37:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [15:37:25] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:32] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:37:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [15:37:37] (03PS1) 10Hnowlan: mw-parsoid: bump workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025398 [15:38:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P61428 and previous config saved to /var/cache/conftool/dbconfig/20240429-153821-marostegui.json [15:38:40] (KubernetesRsyslogDown) firing: rsyslog on kubestagemaster2003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestagemaster2003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:39:28] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9753859 (10YLiou_WMF) >>! In T363514#9746032, @Isaac wrote: > @YLiou_WMF here's the task -- please sign L3 > > @Miriam I put this together so Yu-Ming has... [15:39:48] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:40:15] (03CR) 10JMeybohm: [C:03+1] mw-parsoid: bump workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025398 (owner: 10Hnowlan) [15:40:41] (03CR) 10Hnowlan: [C:03+2] mw-parsoid: bump workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025398 (owner: 10Hnowlan) [15:41:32] (03Merged) 10jenkins-bot: mw-parsoid: bump workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025398 (owner: 10Hnowlan) [15:43:11] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:44:13] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:48:53] (ProbeDown) firing: (2) Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:50:26] (ProbeDown) firing: (19) Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:21] (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster2003 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1025399 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [15:52:40] (KubernetesRsyslogDown) firing: rsyslog on kubestagemaster2003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestagemaster2003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:53:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P61429 and previous config saved to /var/cache/conftool/dbconfig/20240429-155328-marostegui.json [15:53:31] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9753921 (10Dzahn) [15:53:40] (KubernetesRsyslogDown) resolved: rsyslog on kubestagemaster2003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestagemaster2003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:53:53] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1024758 (https://phabricator.wikimedia.org/T363713) (owner: 10Gerrit maintenance bot) [15:53:53] (ProbeDown) firing: (74) Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:55:13] !log Starting s6 codfw failover from db2114 to db2129 - T363713 [15:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:18] T363713: Switchover s6 master (db2114 -> db2129) - https://phabricator.wikimedia.org/T363713 [15:55:26] (ProbeDown) firing: (74) Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:55:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2129 to s6 primary T363713', diff saved to https://phabricator.wikimedia.org/P61430 and previous config saved to /var/cache/conftool/dbconfig/20240429-155557-arnaudb.json [15:56:35] (03CR) 10JMeybohm: [C:03+2] Add kubestage2003 to staging-codfw and conftool [puppet] - 10https://gerrit.wikimedia.org/r/1025397 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [15:56:39] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9753930 (10RobH) [15:56:45] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1025396| Bumping portals to master (T128546)]] (duration: 14m 38s) [15:56:50] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:57:25] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:57:40] (KubernetesRsyslogDown) resolved: rsyslog on kubestagemaster2003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestagemaster2003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:58:39] (03PS1) 10Ssingh: site.pp: set role insetup for cp7002 [puppet] - 10https://gerrit.wikimedia.org/r/1025403 (https://phabricator.wikimedia.org/T362729) [15:58:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2114 T363713', diff saved to https://phabricator.wikimedia.org/P61431 and previous config saved to /var/cache/conftool/dbconfig/20240429-155838-arnaudb.json [15:58:53] (ProbeDown) firing: (74) Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:59:17] (03CR) 10Ssingh: [C:03+2] site.pp: set role insetup for cp7002 [puppet] - 10https://gerrit.wikimedia.org/r/1025403 (https://phabricator.wikimedia.org/T362729) (owner: 10Ssingh) [15:59:45] jayme: ok to merge your change? [15:59:52] (03PS1) 10JMeybohm: Fix copy/paste error for kubestagemaster2003 [puppet] - 10https://gerrit.wikimedia.org/r/1025404 (https://phabricator.wikimedia.org/T363307) [15:59:54] JMeybohm: Add kubestage2003 to staging-codfw and conftool (f089174759) [16:00:00] sukhe: nope [16:00:09] sukhe: please cancel [16:00:10] ok stopping [16:00:25] feel free to merge mine when done thanks [16:00:28] this is actually the first time I'm seeing someone saying no :D [16:00:32] ha! [16:00:38] which means we should keep asking this question :P [16:00:41] will merge yours in a minute [16:00:54] (03CR) 10Dzahn: [C:03+1] "I still see it in private repo, fwiw. But deleting this can at worse break something in cloud VPS, so easy +1 regardless" [labs/private] - 10https://gerrit.wikimedia.org/r/1024824 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:00:54] no rush [16:00:54] we have a roadblock on puppet merge? [16:01:06] robh: yep [16:01:31] (03CR) 10Ssingh: [C:04-2] "Don't merge in general + needs fixes." [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [16:01:37] (03CR) 10JMeybohm: [C:03+2] Fix copy/paste error for kubestagemaster2003 [puppet] - 10https://gerrit.wikimedia.org/r/1025404 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [16:02:05] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9753974 (10elukey) After a lot of tests and config changes, we are almost ready to proceed with prod. Hopefully we'll get to it on April 2nd. [16:02:17] sukhe/robh: unblocked [16:02:23] thanks [16:03:00] (03PS1) 10Andrew Bogott: Update codfw1dev horizon support files to 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1025405 [16:03:24] (03CR) 10Andrew Bogott: [C:03+2] Update codfw1dev horizon support files to 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1025405 (owner: 10Andrew Bogott) [16:04:33] (KubernetesCalicoDown) firing: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:05:26] (ProbeDown) firing: (74) Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:06:47] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp7002.magru.wmnet with OS bullseye [16:06:59] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9753982 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp7002.magru.wmnet with OS bullseye [16:08:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T361627)', diff saved to https://phabricator.wikimedia.org/P61432 and previous config saved to /var/cache/conftool/dbconfig/20240429-160836-marostegui.json [16:08:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [16:08:41] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:08:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [16:08:53] (ProbeDown) firing: (74) Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T361627)', diff saved to https://phabricator.wikimedia.org/P61433 and previous config saved to /var/cache/conftool/dbconfig/20240429-160859-marostegui.json [16:09:14] kubestagemaster2003 alerts is me [16:10:55] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1025396| Bumping portals to master (T128546)]] (duration: 14m 10s) [16:11:05] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:11:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T361627)', diff saved to https://phabricator.wikimedia.org/P61434 and previous config saved to /var/cache/conftool/dbconfig/20240429-161143-marostegui.json [16:15:26] (ProbeDown) firing: (71) Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:17:25] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:50] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7002.magru.wmnet with OS bullseye [16:18:53] (ProbeDown) firing: (66) Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:57] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp7002.magru.wmnet with OS bullseye executed with errors: - cp7002 (**FA... [16:19:05] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7002'] [16:19:25] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp7002'] [16:19:33] (KubernetesCalicoDown) resolved: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:19:49] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7002'] [16:20:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:28] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 25 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:20:30] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp7002'] [16:20:45] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7002'] [16:21:46] (03PS13) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) [16:23:30] !log jayme@cumin1002 conftool action : set/weight=10; selector: name=kubestagemaster2003.codfw.wmnet [16:23:39] !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=kubestagemaster2003.codfw.wmnet [16:23:53] (ProbeDown) firing: (66) Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:25:26] (ProbeDown) firing: (63) Service restbase1032-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:26:25] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7002'] [16:26:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P61435 and previous config saved to /var/cache/conftool/dbconfig/20240429-162650-marostegui.json [16:27:44] !log robh@cumin2002 START - Cookbook sre.dns.netbox [16:28:53] (ProbeDown) firing: (60) Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:28:59] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754123 (10RobH) [16:29:43] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rack b3 cp hosts - robh@cumin2002" [16:29:56] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp7002.magru.wmnet with OS bullseye [16:30:02] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754139 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp7002.magru.wmnet with OS bullseye [16:30:38] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rack b3 cp hosts - robh@cumin2002" [16:30:38] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:32:59] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7003.mgmt.magru.wmnet with reboot policy FORCED [16:33:53] (ProbeDown) firing: (58) Service restbase1033-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:34:35] (03CR) 10Jgiannelos: [C:03+1] Turn on ParserMigration extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024407 (owner: 10C. Scott Ananian) [16:34:53] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754177 (10RobH) [16:35:26] (ProbeDown) firing: (56) Service restbase1033-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:37:36] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7005.mgmt.magru.wmnet with reboot policy FORCED [16:37:57] 10ops-magru, 13Patch-For-Review: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722#9754193 (10BCornwall) [16:40:26] (ProbeDown) firing: (54) Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:40:40] (03PS5) 10Santiago Faci: Creating staging and production helmfiles for MPIC (Metrics Platform Instrument Configurator) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) [16:41:17] (03CR) 10Santiago Faci: Creating staging and production helmfiles for MPIC (Metrics Platform Instrument Configurator) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) (owner: 10Santiago Faci) [16:41:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P61436 and previous config saved to /var/cache/conftool/dbconfig/20240429-164158-marostegui.json [16:43:53] (ProbeDown) firing: (50) Service restbase1034-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:44:40] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7003.mgmt.magru.wmnet with reboot policy FORCED [16:45:55] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7007.mgmt.magru.wmnet with reboot policy FORCED [16:46:37] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754260 (10RobH) [16:49:13] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7005.mgmt.magru.wmnet with reboot policy FORCED [16:50:17] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7009.mgmt.magru.wmnet with reboot policy FORCED [16:50:26] (ProbeDown) firing: (48) Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:50:36] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754274 (10RobH) [16:51:40] (03PS1) 10Ebernhardson: cirrus updater: Reduce saneitizer capacity and fix backfill arguments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025410 [16:51:40] (03PS1) 10Ebernhardson: cirrus updater: Enable codfw consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025411 (https://phabricator.wikimedia.org/T363475) [16:53:53] (ProbeDown) firing: (43) Service restbase1035-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:56:17] (03CR) 10Dzahn: [C:03+2] miscweb: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1024640 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:57:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T361627)', diff saved to https://phabricator.wikimedia.org/P61437 and previous config saved to /var/cache/conftool/dbconfig/20240429-165705-marostegui.json [16:57:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1210.eqiad.wmnet with reason: Maintenance [16:57:11] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:57:12] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7007.mgmt.magru.wmnet with reboot policy FORCED [16:57:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1210.eqiad.wmnet with reason: Maintenance [16:57:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T361627)', diff saved to https://phabricator.wikimedia.org/P61438 and previous config saved to /var/cache/conftool/dbconfig/20240429-165728-marostegui.json [16:58:53] (ProbeDown) firing: (42) Service restbase1036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:58:53] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7011.mgmt.magru.wmnet with reboot policy FORCED [16:59:14] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bullseye [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T1700) [17:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T1700). [17:00:26] (ProbeDown) firing: (40) Service restbase1036-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:01:25] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754341 (10RobH) [17:02:14] (03CR) 10Ebernhardson: [C:03+2] cirrus updater: Enable codfw consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025411 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [17:02:25] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7009.mgmt.magru.wmnet with reboot policy FORCED [17:03:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T361627)', diff saved to https://phabricator.wikimedia.org/P61439 and previous config saved to /var/cache/conftool/dbconfig/20240429-170311-marostegui.json [17:03:14] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7013.mgmt.magru.wmnet with reboot policy FORCED [17:03:18] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:03:50] (03PS1) 10Ayounsi: magru: update novacore v6 IP [homer/public] - 10https://gerrit.wikimedia.org/r/1025414 (https://phabricator.wikimedia.org/T362421) [17:03:53] (ProbeDown) firing: (36) Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:04:30] (03CR) 10Cathal Mooney: [C:03+1] magru: update novacore v6 IP [homer/public] - 10https://gerrit.wikimedia.org/r/1025414 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [17:04:54] (03CR) 10Ayounsi: [C:03+2] magru: update novacore v6 IP [homer/public] - 10https://gerrit.wikimedia.org/r/1025414 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [17:06:07] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7015.mgmt.magru.wmnet with reboot policy FORCED [17:06:07] (03Merged) 10jenkins-bot: magru: update novacore v6 IP [homer/public] - 10https://gerrit.wikimedia.org/r/1025414 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [17:06:35] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754367 (10RobH) [17:07:03] (03PS1) 10Ssingh: hiera: update ntp_peers list for magru [puppet] - 10https://gerrit.wikimedia.org/r/1025415 (https://phabricator.wikimedia.org/T346722) [17:07:37] (03CR) 10Ssingh: [C:03+2] hiera: update ntp_peers list for magru [puppet] - 10https://gerrit.wikimedia.org/r/1025415 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [17:08:53] (ProbeDown) firing: (34) Service restbase1037-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:09:13] (03PS2) 10Scott French: wmnet: add CNAME records for commons-impact-analytics (k8s ingress) [dns] - 10https://gerrit.wikimedia.org/r/1023964 (https://phabricator.wikimedia.org/T361835) [17:10:26] (ProbeDown) firing: (33) Service restbase1037-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:11:26] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7011.mgmt.magru.wmnet with reboot policy FORCED [17:12:37] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9754389 (10Jhancock.wm) Apologies for the wait on this one. I checked out the server and the drives look to be working physically. But when I logged into the idrac it sees zero disks. Checked the warranty an... [17:12:47] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023966 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:13:47] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023967 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:13:53] (ProbeDown) firing: (30) Service restbase1038-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:13:54] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7002.magru.wmnet with OS bullseye [17:13:58] !log ebernhardson@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:13:58] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754403 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp7002.magru.wmnet with OS bullseye executed with errors: - cp7002 (**FAIL**) - Removed from... [17:14:06] !log ebernhardson@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:14:51] (03PS1) 10Ssingh: hiera: add magru to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/1025419 (https://phabricator.wikimedia.org/T346722) [17:15:29] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7013.mgmt.magru.wmnet with reboot policy FORCED [17:17:05] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1025419 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [17:17:13] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage [17:17:19] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7015.mgmt.magru.wmnet with reboot policy FORCED [17:18:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P61440 and previous config saved to /var/cache/conftool/dbconfig/20240429-171818-marostegui.json [17:18:21] (03CR) 10Ssingh: [C:03+2] hiera: add magru to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/1025419 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [17:18:37] (03PS1) 10JMeybohm: etcd: Notify etcd on PKI cert generation and reneval [puppet] - 10https://gerrit.wikimedia.org/r/1025422 (https://phabricator.wikimedia.org/T363307) [17:18:53] (ProbeDown) firing: (28) Service restbase1038-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:46] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:20:12] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage [17:20:14] (03CR) 10Ebernhardson: [C:03+2] cirrus updater: Reduce saneitizer capacity and fix backfill arguments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025410 (owner: 10Ebernhardson) [17:20:26] (ProbeDown) firing: (26) Service restbase1038-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:20:26] (03CR) 10JMeybohm: "@alex: do you know by chance?" [puppet] - 10https://gerrit.wikimedia.org/r/1025422 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [17:20:48] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754449 (10RobH) [17:21:12] (03Merged) 10jenkins-bot: cirrus updater: Reduce saneitizer capacity and fix backfill arguments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025410 (owner: 10Ebernhardson) [17:21:14] (03Merged) 10jenkins-bot: cirrus updater: Enable codfw consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025411 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [17:22:02] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rack b4 cp hosts - robh@cumin2002" [17:22:55] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rack b4 cp hosts - robh@cumin2002" [17:22:55] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:24:24] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7004.mgmt.magru.wmnet with reboot policy FORCED [17:24:26] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7006.mgmt.magru.wmnet with reboot policy FORCED [17:24:28] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7008.mgmt.magru.wmnet with reboot policy FORCED [17:24:29] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7010.mgmt.magru.wmnet with reboot policy FORCED [17:25:03] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7012.mgmt.magru.wmnet with reboot policy FORCED [17:25:26] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp7006.mgmt.magru.wmnet with reboot policy FORCED [17:25:26] (ProbeDown) firing: (24) Service restbase1039-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:41] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7014.mgmt.magru.wmnet with reboot policy FORCED [17:26:57] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7016.mgmt.magru.wmnet with reboot policy FORCED [17:28:44] !log ebernhardson@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:28:51] !log ebernhardson@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:28:53] (ProbeDown) firing: (20) Service restbase1039-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:29:03] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7006.mgmt.magru.wmnet with reboot policy FORCED [17:31:44] (CirrusStreamingUpdaterFlinkJobUnstable) firing: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [17:33:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P61441 and previous config saved to /var/cache/conftool/dbconfig/20240429-173326-marostegui.json [17:34:29] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 203034464 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:35:26] (ProbeDown) firing: (18) Service restbase1040-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:35:29] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 29368 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:36:19] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7010.mgmt.magru.wmnet with reboot policy FORCED [17:36:30] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1005.eqiad.wmnet with OS bullseye [17:36:44] (CirrusStreamingUpdaterFlinkJobUnstable) resolved: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [17:36:53] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7004.mgmt.magru.wmnet with reboot policy FORCED [17:37:00] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7008.mgmt.magru.wmnet with reboot policy FORCED [17:37:07] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7012.mgmt.magru.wmnet with reboot policy FORCED [17:38:03] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7016.mgmt.magru.wmnet with reboot policy FORCED [17:38:20] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7014.mgmt.magru.wmnet with reboot policy FORCED [17:38:32] (03PS14) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) [17:38:53] (ProbeDown) resolved: (12) Service restbase1041-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:40:38] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7006.mgmt.magru.wmnet with reboot policy FORCED [17:41:25] (03PS1) 10Btullis: Allow the ceph-common package to create the ceph user/group [puppet] - 10https://gerrit.wikimedia.org/r/1025428 (https://phabricator.wikimedia.org/T362993) [17:41:27] !log root@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[29-42]*: Move Cassandra to PKI - root@cumin1002 [17:42:15] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7003'] [17:42:21] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7004'] [17:42:26] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7005'] [17:42:31] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7006'] [17:42:35] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7007'] [17:42:40] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7008'] [17:42:50] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7009'] [17:43:00] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2176/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025428 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [17:45:35] (03Abandoned) 10Ebernhardson: Provide a specific user agent when checking servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/743222 (owner: 10Ebernhardson) [17:48:01] (03PS1) 10Fabfur: site: adding cp7003.magru.wmnet for test insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1025430 (https://phabricator.wikimedia.org/T362729) [17:48:02] (03PS1) 10Andrew Bogott: horizon local_settings.py: forward some changes that were made on zed [puppet] - 10https://gerrit.wikimedia.org/r/1025431 [17:48:26] (03CR) 10Ssingh: [C:03+1] site: adding cp7003.magru.wmnet for test insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1025430 (https://phabricator.wikimedia.org/T362729) (owner: 10Fabfur) [17:48:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T361627)', diff saved to https://phabricator.wikimedia.org/P61442 and previous config saved to /var/cache/conftool/dbconfig/20240429-174834-marostegui.json [17:48:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:48:40] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:48:41] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7003'] [17:48:49] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7006'] [17:48:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:48:55] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7008'] [17:48:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T361627)', diff saved to https://phabricator.wikimedia.org/P61443 and previous config saved to /var/cache/conftool/dbconfig/20240429-174856-marostegui.json [17:49:02] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7004'] [17:49:15] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7005'] [17:49:24] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7007'] [17:49:28] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7009'] [17:50:24] (03CR) 10Fabfur: [C:03+2] site: adding cp7003.magru.wmnet for test insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1025430 (https://phabricator.wikimedia.org/T362729) (owner: 10Fabfur) [17:51:45] 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T363661#9754620 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate T362033 [17:53:29] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7010'] [17:53:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T361627)', diff saved to https://phabricator.wikimedia.org/P61444 and previous config saved to /var/cache/conftool/dbconfig/20240429-175340-marostegui.json [17:53:45] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:53:46] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7011'] [17:53:49] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7003.magru.wmnet with OS bullseye [17:53:58] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754648 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7003.magru.wmnet with OS bullseye [17:54:05] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7012'] [17:54:21] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7013'] [17:54:23] 10ops-eqiad, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T363566#9754652 (10Jclark-ctr) a:03Jclark-ctr [17:55:26] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7014'] [17:55:39] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9754660 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [17:56:08] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7015'] [17:56:21] (03CR) 10Btullis: [V:03+1] "The current discrepancy can be seen here:" [puppet] - 10https://gerrit.wikimedia.org/r/1025428 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [17:56:23] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7016'] [17:57:16] (03PS15) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [17:58:02] (03PS16) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) [17:59:06] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7010'] [17:59:33] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754700 (10RobH) [17:59:47] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7011'] [18:00:56] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7012'] [18:01:14] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7013'] [18:01:42] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7015'] [18:01:50] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7014'] [18:01:58] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:02:30] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7016'] [18:03:39] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9754742 (10andrea.denisse) It seems like `thanos-swift` is not using CFSSL. [18:05:37] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:08:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P61445 and previous config saved to /var/cache/conftool/dbconfig/20240429-180848-marostegui.json [18:09:59] (03PS1) 10Ssingh: magru: add hierdata/magru.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1025436 (https://phabricator.wikimedia.org/T346722) [18:10:16] (03PS1) 10Andrew Bogott: codfw1dev: new Horizon version [puppet] - 10https://gerrit.wikimedia.org/r/1025437 [18:10:55] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: new Horizon version [puppet] - 10https://gerrit.wikimedia.org/r/1025437 (owner: 10Andrew Bogott) [18:11:01] (03CR) 10Andrew Bogott: [C:03+2] horizon local_settings.py: forward some changes that were made on zed [puppet] - 10https://gerrit.wikimedia.org/r/1025431 (owner: 10Andrew Bogott) [18:11:14] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1025436 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [18:11:45] (03CR) 10Fabfur: [C:03+1] "lgmt" [puppet] - 10https://gerrit.wikimedia.org/r/1025436 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [18:11:51] (03CR) 10Ssingh: [C:03+2] magru: add hierdata/magru.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1025436 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [18:16:45] (03PS2) 10Ebernhardson: Shift writes to SUP, 1st batch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024411 (https://phabricator.wikimedia.org/T363475) (owner: 10Peter Fischer) [18:22:29] (03PS1) 10Ssingh: hiera: add magru prometheus node to prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1025441 (https://phabricator.wikimedia.org/T346722) [18:22:45] (03CR) 10Andrea Denisse: [V:03+2] ssl: Remove unnecessary dummy key from thanos-query hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1024824 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:22:47] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] ssl: Remove unnecessary dummy key from thanos-query hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1024824 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:22:51] 10ops-eqiad, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T362366#9754821 (10cmooney) 05Open→03Resolved Looks like this was a brief blip of inbound errors (unlike last time when they began and kept increasing until eventually the link failed). {F49321128} As such I'm gonna clos... [18:23:10] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9754832 (10Jclark-ctr) 05Resolved→03Open [18:23:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P61446 and previous config saved to /var/cache/conftool/dbconfig/20240429-182355-marostegui.json [18:24:20] 10ops-eqiad, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T363566#9754835 (10Jclark-ctr) 05Open→03Resolved T363086 duplicate [18:25:32] !log Manually delete unused TLS certificates for thanos-query as part of the CFSSL migration - T360414 [18:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:38] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [18:26:37] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9754850 (10Jclark-ctr) @Clement_Goubert @akosiaris since this failed again i did reset idrac again and is back up right now. Idrac is not showing anything and is out of warranty. with my limited access... [18:27:24] (03CR) 10Fabfur: [C:03+1] "let's try this" [puppet] - 10https://gerrit.wikimedia.org/r/1025441 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [18:27:46] (03CR) 10Ssingh: [C:03+2] hiera: add magru prometheus node to prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1025441 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [18:28:10] (03CR) 10Majavah: "does the host not need to exist in DNS before merging this?" [puppet] - 10https://gerrit.wikimedia.org/r/1025441 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [18:29:42] (03CR) 10Ssingh: [C:03+2] "The site is not up and running and we are trying to see if this resolves a failing Puppet run while reimaging. So if it does, we will see " [puppet] - 10https://gerrit.wikimedia.org/r/1025441 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [18:30:30] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2120.codfw.wmnet - https://phabricator.wikimedia.org/T362787#9754855 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:30:43] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9754875 (10andrea.denisse) [18:31:02] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2119.codfw.wmnet - https://phabricator.wikimedia.org/T362790#9754860 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:31:06] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9754879 (10andrea.denisse) [18:31:24] (03PS1) 10Ayounsi: magru: update novacore IPv6 once more [homer/public] - 10https://gerrit.wikimedia.org/r/1025442 (https://phabricator.wikimedia.org/T362421) [18:32:05] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2113.codfw.wmnet - https://phabricator.wikimedia.org/T362792#9754881 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:32:14] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2112.codfw.wmnet - https://phabricator.wikimedia.org/T362793#9754894 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:32:53] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2111.codfw.wmnet - https://phabricator.wikimedia.org/T362794#9754901 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:33:31] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2110.codfw.wmnet - https://phabricator.wikimedia.org/T362795#9754905 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:34:03] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2109.codfw.wmnet - https://phabricator.wikimedia.org/T362796#9754911 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:34:37] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2108.codfw.wmnet - https://phabricator.wikimedia.org/T362797#9754914 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:35:05] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2107.codfw.wmnet - https://phabricator.wikimedia.org/T362798#9754918 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:35:34] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2105.codfw.wmnet - https://phabricator.wikimedia.org/T362800#9754932 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:35:47] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2106.codfw.wmnet - https://phabricator.wikimedia.org/T362799#9754927 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:36:17] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2103.codfw.wmnet - https://phabricator.wikimedia.org/T362801#9754936 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:39:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T361627)', diff saved to https://phabricator.wikimedia.org/P61447 and previous config saved to /var/cache/conftool/dbconfig/20240429-183903-marostegui.json [18:39:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [18:39:08] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:39:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [18:41:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [18:41:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [18:43:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [18:43:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [18:44:57] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9754952 (10Eevans) Ok, so to summarize what has happened so far: We set about to replace what was `sdg`, but the wrong device was pulled by accident (`sdf` was pulled). When //that// drive was reinstalled... [18:45:34] (03CR) 10Ayounsi: [C:03+2] magru: update novacore IPv6 once more [homer/public] - 10https://gerrit.wikimedia.org/r/1025442 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [18:46:10] (03Merged) 10jenkins-bot: magru: update novacore IPv6 once more [homer/public] - 10https://gerrit.wikimedia.org/r/1025442 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [18:49:06] (03CR) 10Scott French: [C:03+2] wmnet: add CNAME records for commons-impact-analytics (k8s ingress) [dns] - 10https://gerrit.wikimedia.org/r/1023964 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [18:50:42] !log running authdns-update on dns1004 for T361835 [18:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:47] T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835 [18:51:41] (03PS2) 10Andrea Denisse: trafficserver: Add discovery entries for grafana and grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/1024808 (https://phabricator.wikimedia.org/T356386) [18:53:21] (03PS3) 10Andrea Denisse: wmnet: Add discovery entries for grafana and grafana-next [dns] - 10https://gerrit.wikimedia.org/r/1024806 (https://phabricator.wikimedia.org/T356386) [18:53:24] (03PS1) 10Ssingh: hiera: insetup::traffic: set prometheus_nodes to [] [puppet] - 10https://gerrit.wikimedia.org/r/1025443 [18:53:58] (03CR) 10Ssingh: [C:03+2] hiera: insetup::traffic: set prometheus_nodes to [] [puppet] - 10https://gerrit.wikimedia.org/r/1025443 (owner: 10Ssingh) [18:54:51] (03PS4) 10Andrea Denisse: wmnet: Add discovery entries for grafana and grafana-next [dns] - 10https://gerrit.wikimedia.org/r/1024806 (https://phabricator.wikimedia.org/T356386) [18:58:27] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7002.magru.wmnet with OS bullseye [18:58:37] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7002.magru.wmnet with OS bullseye [18:58:53] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:04:56] PROBLEM - Router interfaces on cr1-magru is CRITICAL: CRITICAL: host 195.200.68.128, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:06:07] (03PS1) 10Andrea Denisse: trafficserver: Add discovery entries for prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/1025445 (https://phabricator.wikimedia.org/T356386) [19:08:10] !log fabfur@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7003.magru.wmnet with OS bullseye [19:08:18] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9755009 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7003.magru.wmnet with OS bullseye executed with errors: - cp7003 (**FAIL**) - Removed fr... [19:11:09] !log mforns@deploy1002 Started deploy [analytics/refinery@1693892]: Fixes to queries for Commons Impact Metrics dumps [analytics/refinery@1693892a] [19:16:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 36.09% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:16:28] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:16:36] (03PS1) 10Andrea Denisse: wmnet: Add discovery entries for the Prometheus hosts [dns] - 10https://gerrit.wikimedia.org/r/1025447 (https://phabricator.wikimedia.org/T356386) [19:17:38] (03CR) 10CI reject: [V:04-1] wmnet: Add discovery entries for the Prometheus hosts [dns] - 10https://gerrit.wikimedia.org/r/1025447 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [19:21:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 36.09% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:25:07] !log mforns@deploy1002 Finished deploy [analytics/refinery@1693892]: Fixes to queries for Commons Impact Metrics dumps [analytics/refinery@1693892a] (duration: 13m 58s) [19:25:13] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9755087 (10Jclark-ctr) @Volans We have replaced this drive 4 times now and continues to fail we no longer suspect that it is a Drive issue and maybe a process issues for recreating mdadm raid 10. We are a... [19:25:57] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9755105 (10Eevans) Ok, `sdf` has been replaced //again//, here is a transcript of what was done to add it back to the array: `sh-session eevans@aqs1014:~$ sudo lshw -class disk *-disk:0... [19:26:12] !log mforns@deploy1002 Started deploy [analytics/refinery@1693892] (thin): Fixes queries for Commons Impact MEtrics dumps THIN [analytics/refinery@1693892a] [19:27:38] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [19:29:58] !log mforns@deploy1002 Finished deploy [analytics/refinery@1693892] (thin): Fixes queries for Commons Impact MEtrics dumps THIN [analytics/refinery@1693892a] (duration: 03m 46s) [19:30:13] !log mforns@deploy1002 Started deploy [analytics/refinery@1693892] (hadoop-test): Fixes queries for Commons Impact MEtrics dumps TEST [analytics/refinery@1693892a] [19:31:29] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [19:31:52] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7003.magru.wmnet with OS bullseye [19:31:59] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9755136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7003.magru.wmnet with OS bullseye [19:33:15] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9755130 (10RobH) 05Open→03In progress a:03RobH Stealing this task to do the network provision, bios provision, dns setup, and firmware setu... [19:33:36] !log mforns@deploy1002 Finished deploy [analytics/refinery@1693892] (hadoop-test): Fixes queries for Commons Impact MEtrics dumps TEST [analytics/refinery@1693892a] (duration: 03m 22s) [19:41:17] !log robh@cumin2002 START - Cookbook sre.dns.netbox [19:45:24] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru misc hosts - robh@cumin2002" [19:46:10] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru misc hosts - robh@cumin2002" [19:46:10] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:47:39] (03PS1) 10Bking: search-platform: monitoring/alert on upstream MW API errors [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [19:51:23] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T2000). [20:00:04] esanders, cscott, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:26] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7003.magru.wmnet with reason: host reimage [20:02:06] hi - i can deploy [20:02:22] but are any of the patch holders around? [20:02:37] \o [20:02:58] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns7001.mgmt.magru.wmnet with reboot policy FORCED [20:03:11] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7003.magru.wmnet with reason: host reimage [20:03:14] hi ebernhardson - i guess i'll do yours first [20:03:37] cjming: awesome. Can't really test it much ahead of time, it's job-queue related [20:03:38] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns7002.mgmt.magru.wmnet with reboot policy FORCED [20:03:48] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7001.mgmt.magru.wmnet with reboot policy FORCED [20:03:59] ebernhardson: sounds good - i'll just sync when the time comes [20:04:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024411 (https://phabricator.wikimedia.org/T363475) (owner: 10Peter Fischer) [20:04:34] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7002.mgmt.magru.wmnet with reboot policy FORCED [20:04:46] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7003.mgmt.magru.wmnet with reboot policy FORCED [20:05:15] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7004.mgmt.magru.wmnet with reboot policy FORCED [20:05:23] (03Merged) 10jenkins-bot: Shift writes to SUP, 1st batch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024411 (https://phabricator.wikimedia.org/T363475) (owner: 10Peter Fischer) [20:05:33] here [20:05:36] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs7001.mgmt.magru.wmnet with reboot policy FORCED [20:05:39] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1024411|Shift writes to SUP, 1st batch (T363475)]] [20:05:43] T363475: SUP: Shift Writes from Cirrus to SUP - https://phabricator.wikimedia.org/T363475 [20:06:35] hi edsanders - i'll deploy your patch next - should be in a few mins [20:06:57] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs7003.mgmt.magru.wmnet with reboot policy FORCED [20:07:55] !log withdrawing prefixes from EdgeUno transit in magru to test paths via second transit [20:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:26] (03PS2) 10Scott French: service: add commons-impact-analytics AQS 2.0 service [puppet] - 10https://gerrit.wikimedia.org/r/1023961 (https://phabricator.wikimedia.org/T361835) [20:08:26] (03PS2) 10Scott French: DNM: service: move commons-impact-analytics service to production state [puppet] - 10https://gerrit.wikimedia.org/r/1023962 (https://phabricator.wikimedia.org/T361835) [20:09:03] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [20:09:04] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7002.magru.wmnet with OS bullseye [20:09:07] !log cjming@deploy1002 cjming and pfischer: Backport for [[gerrit:1024411|Shift writes to SUP, 1st batch (T363475)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:09:10] !log cjming@deploy1002 cjming and pfischer: Continuing with sync [20:09:12] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9755247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7002.magru.wmnet with OS bullseye completed: - cp7002 (**WARN**) - Downtimed on Icinga/Al... [20:11:17] (03CR) 10Scott French: [C:03+2] service: add commons-impact-analytics AQS 2.0 service [puppet] - 10https://gerrit.wikimedia.org/r/1023961 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [20:12:47] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T363756 (10phaultfinder) 03NEW [20:13:36] (03PS10) 10Esanders: Turn off DiscussionTools A/B test, and enable features on those wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) [20:14:47] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns7002.mgmt.magru.wmnet with reboot policy FORCED [20:15:03] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns7001.mgmt.magru.wmnet with reboot policy FORCED [20:15:42] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7002.mgmt.magru.wmnet with reboot policy FORCED [20:16:13] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7003.mgmt.magru.wmnet with reboot policy FORCED [20:16:24] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7001.mgmt.magru.wmnet with reboot policy FORCED [20:17:12] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7004.mgmt.magru.wmnet with reboot policy FORCED [20:17:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:08] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 12), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9755282 (10Scott_French) I believe that's everything that can be done for now,... [20:19:00] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns7001'] [20:19:12] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns7001'] [20:19:15] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns7002'] [20:19:51] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns7001'] [20:19:55] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns7001'] [20:20:22] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns7001'] [20:20:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:20:28] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns7001'] [20:21:57] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1024411|Shift writes to SUP, 1st batch (T363475)]] (duration: 16m 17s) [20:21:58] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs7003.mgmt.magru.wmnet with reboot policy FORCED [20:22:03] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs7001.mgmt.magru.wmnet with reboot policy FORCED [20:22:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) (owner: 10Esanders) [20:22:04] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti7001'] [20:22:09] T363475: SUP: Shift Writes from Cirrus to SUP - https://phabricator.wikimedia.org/T363475 [20:22:23] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti7002'] [20:22:55] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti7003'] [20:23:01] (03Merged) 10jenkins-bot: Turn off DiscussionTools A/B test, and enable features on those wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) (owner: 10Esanders) [20:23:05] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [20:23:18] !log cjming@deploy1002 Started scap: Backport for [[gerrit:954920|Turn off DiscussionTools A/B test, and enable features on those wikis (T341491)]] [20:23:19] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti7004'] [20:23:24] T341491: [MILESTONE] Deploy config change to "turn off" Usability Improvements A/B test and enable features for A/B test wikis - https://phabricator.wikimedia.org/T341491 [20:23:42] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs7001'] [20:23:54] (03PS5) 10Jdlrobson: Update wgVectorClientPrefs to wgVectorAppearance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023928 (https://phabricator.wikimedia.org/T362808) (owner: 10Bernard Wang) [20:23:57] (03CR) 10Jdlrobson: [C:03+1] Update wgVectorClientPrefs to wgVectorAppearance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023928 (https://phabricator.wikimedia.org/T362808) (owner: 10Bernard Wang) [20:24:08] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs7002'] [20:24:10] ebernhardson: should be live! [20:24:58] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs7002'] [20:25:53] cjming: yup, seeing it working. thanks! [20:25:59] !log cjming@deploy1002 cjming and esanders: Backport for [[gerrit:954920|Turn off DiscussionTools A/B test, and enable features on those wikis (T341491)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:26:12] glad to hear it! [20:26:14] edsanders: up on test servers if you want to check [20:26:14] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs7002.mgmt.magru.wmnet with reboot policy FORCED [20:26:26] cjming: looking [20:26:45] Yup, looks good - thanks [20:26:52] cool - syncing [20:26:56] !log cjming@deploy1002 cjming and esanders: Continuing with sync [20:27:29] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns7002'] [20:27:39] !log re-announcing magru prefixes to from EdgeUno [20:27:42] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti7001'] [20:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:15] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti7002'] [20:28:24] cscott: are you around? [20:28:31] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti7003'] [20:28:47] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns7001'] [20:28:57] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns7001'] [20:29:39] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs7001'] [20:29:45] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti7004'] [20:31:43] PROBLEM - Check whether ferm is active by checking the default input chain on mw1396 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:31:43] PROBLEM - Check whether ferm is active by checking the default input chain on mw1408 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:32:05] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2022 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:32:45] PROBLEM - SSH on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:33:22] (ProbeDown) firing: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:33:35] RECOVERY - SSH on moscovium is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:34:39] PROBLEM - Check whether ferm is active by checking the default input chain on mw1379 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:35:27] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs7003'] [20:37:34] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [20:37:34] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7003.magru.wmnet with OS bullseye [20:37:41] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9755368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7003.magru.wmnet with OS bullseye completed: - cp7003 (**WARN**) - Downtimed on Icinga/A... [20:38:22] (ProbeDown) resolved: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:38:27] (03CR) 10Brouberol: "Once the extra whitespace is removed, this is good to go!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) (owner: 10Santiago Faci) [20:39:24] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:954920|Turn off DiscussionTools A/B test, and enable features on those wikis (T341491)]] (duration: 16m 06s) [20:39:29] T341491: [MILESTONE] Deploy config change to "turn off" Usability Improvements A/B test and enable features for A/B test wikis - https://phabricator.wikimedia.org/T341491 [20:39:48] edsanders: should be live! [20:39:57] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:39:58] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9755387 (10RobH) [20:40:34] cjming: yup - looks good [20:40:37] cjming: thanks! [20:40:46] yw! [20:41:04] !log robh@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:41:14] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs7003'] [20:41:19] (03PS2) 10C. Scott Ananian: Turn on ParserMigration extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024407 [20:41:22] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9755389 (10RobH) [20:42:02] cscott: i'll hold the window for a few more minutes in case you still want your changes to go out [20:54:52] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns7001'] [20:57:38] hey, deployers.  i'm way late for the backport window, sorry. [20:57:47] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@8c9c32c]: (no justification provided) [20:58:17] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@8c9c32c]: (no justification provided) (duration: 00m 30s) [20:58:41] cscott: do your patches need to go out today? [20:58:41] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7001.mgmt.magru.wmnet with reboot policy FORCED [20:58:43] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7002.mgmt.magru.wmnet with reboot policy FORCED [20:58:44] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7003.mgmt.magru.wmnet with reboot policy FORCED [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240429T2100). [21:00:08] cjming: they aren't time sensitive, i can just push them to tomorrow [21:00:21] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9755454 (10RobH) [21:00:40] cjming: but i don't have backlog turned on for this channel so I thought i'd check in just in case deployers were still deploying [21:01:13] cscott: is it ok to push to tomorrow? i have a mtg now [21:01:21] works for me, no worries [21:01:34] i'll edit the calendar [21:01:39] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns7001'] [21:01:43] RECOVERY - Check whether ferm is active by checking the default input chain on mw1396 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:01:43] RECOVERY - Check whether ferm is active by checking the default input chain on mw1408 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:01:51] cool - thanks [21:02:00] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs7002.mgmt.magru.wmnet with reboot policy FORCED [21:02:00] my own fault for missing the window [21:02:05] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2022 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:02:38] (03CR) 10Cathal Mooney: "LGTM! IPs are now assigned in netbox:" [puppet] - 10https://gerrit.wikimedia.org/r/1025354 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [21:02:48] !log end of UTC late backport window [21:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:39] RECOVERY - Check whether ferm is active by checking the default input chain on mw1379 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:04:49] (03PS10) 10Bking: search-platform: monitoring/alert on upstream MW API errors [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [21:06:13] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7003.mgmt.magru.wmnet with reboot policy FORCED [21:06:14] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7002.mgmt.magru.wmnet with reboot policy FORCED [21:06:16] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs7002'] [21:06:24] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs7002'] [21:06:32] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs7002'] [21:06:40] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7001.mgmt.magru.wmnet with reboot policy FORCED [21:06:52] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7004.mgmt.magru.wmnet with reboot policy FORCED [21:14:20] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7004.mgmt.magru.wmnet with reboot policy FORCED [21:19:11] (03PS11) 10Bking: search-platform: monitoring/alert on upstream MW API errors [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [21:19:45] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs7002'] [21:24:00] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9755541 (10Volans) @Jclark-ctr what do you mean by "process issues"? If `mdadm` shows the raid OK after the rebuilt I don't see problems there. Have we already tried to exclude other kind of problems? Such... [21:24:55] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:33:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:33:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:35:57] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51783 bytes in 0.611 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:36:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:36:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:36:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:42:13] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:51:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:51:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:51:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:03:23] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9755661 (10RobH) 05In progress→03Open a:05RobH→03None All of the misc hosts have had network provisioning, firmware, and bios provisionin... [22:58:53] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:02:09] 06SRE, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#9755853 (10Reedy) [23:04:26] 06SRE, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#9755861 (10Reedy) [23:16:28] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:23:10] (03PS1) 10Aklapper: Phabricator: Add safe.directory directives [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) [23:31:37] (03CR) 10Aklapper: "Note that I have no clue if this will automagically work and I am basically just copying code that looks similar." [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper) [23:38:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024760 [23:38:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024760 (owner: 10TrainBranchBot) [23:53:06] (03PS1) 10Ssingh: Revert "hiera: insetup::traffic: set prometheus_nodes to []" [puppet] - 10https://gerrit.wikimedia.org/r/1025320 [23:53:46] (03PS1) 10Ssingh: site.pp: add cp7004 [puppet] - 10https://gerrit.wikimedia.org/r/1025482 (https://phabricator.wikimedia.org/T362729) [23:54:25] (03CR) 10Ssingh: [C:03+2] Revert "hiera: insetup::traffic: set prometheus_nodes to []" [puppet] - 10https://gerrit.wikimedia.org/r/1025320 (owner: 10Ssingh) [23:54:41] (03CR) 10Ssingh: [C:03+2] site.pp: add cp7004 [puppet] - 10https://gerrit.wikimedia.org/r/1025482 (https://phabricator.wikimedia.org/T362729) (owner: 10Ssingh) [23:56:47] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7004.magru.wmnet with OS bullseye [23:56:57] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9755997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7004.magru.wmnet with OS bullseye [23:58:01] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024760 (owner: 10TrainBranchBot)