[00:04:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [00:10:37] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:10:57] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:11:46] !log removing one file for legal compliancee [00:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [00:13:17] !log removing 2files for legal compliance [00:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:05] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:15:13] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:15:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:35] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:15:37] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:20:11] !log removing one file for legal compliance [00:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:37] PROBLEM - WDQS SPARQL on wdqs2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.208 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:39:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930843 [00:39:33] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930843 (owner: 10TrainBranchBot) [00:58:43] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930843 (owner: 10TrainBranchBot) [01:01:37] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:01:49] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:06:13] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:07:45] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:07:47] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:13:57] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:18:21] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:18:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [01:18:41] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:27:57] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:29:29] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:32:23] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:33:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [01:33:57] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:44:41] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:44:51] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:46:15] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:46:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10nshahquinn-wmf) I'm all set up with the new account with the exception of some Superset issues: T339385. Once those are sorted, we can lock th... [01:46:23] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:50:53] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:51:01] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:51:03] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:52:31] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:58:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:00:07] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:31] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:03:15] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:03:25] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:01] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:11:07] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:12:33] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:12:41] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:20:15] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:20:21] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:27:59] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:28:07] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:31:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [02:32:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:41] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:32:45] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:34:13] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia, AS1299/IPv6: OpenSent - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:38:51] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:46:27] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:46:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [02:48:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:49:35] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:52:39] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:01:59] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:02:03] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:06:31] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:06:35] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:11:07] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: OpenSent - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:11:09] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:11:13] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:11:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [03:12:41] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:15:47] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:15:49] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:21:57] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:21:59] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:31:15] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:31:23] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:46:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [04:23:35] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:23:39] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:26:41] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:26:43] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:38:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [04:39:52] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui) >>! In T326346#8950646, @RobH wrote: > Ok, I figured out the issue on the installations here. The first round of R450s mistakenly came with raid con... [04:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:53:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [05:03:21] (03PS1) 10Marostegui: dbproxy1022: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/931726 (https://phabricator.wikimedia.org/T337812) [05:04:34] (03CR) 10Marostegui: [C: 03+2] dbproxy1022: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/931726 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [05:07:27] (03PS1) 10Marostegui: site.pp: Remove dbproxy1022 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/931727 (https://phabricator.wikimedia.org/T337812) [05:15:55] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:18:59] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:23:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [05:28:15] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:28:19] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:29:47] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:29:51] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:42:08] (03PS2) 10Dzahn: site: add buster people VMs to insetup role for decom [puppet] - 10https://gerrit.wikimedia.org/r/931699 (https://phabricator.wikimedia.org/T338827) [05:48:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [05:48:36] 10SRE, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) Thank you, @taavi :) [05:49:16] 10SRE, 10Gerrit, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10Dzahn) Thank you @ssingh :) [05:56:00] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: remove gitlab_default_can_create_group setting [puppet] - 10https://gerrit.wikimedia.org/r/931259 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto) [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230621T0600) [06:03:31] !log ariel@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1003.eqiad.wmnet with OS bullseye [06:04:01] PROBLEM - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync_bitu_username_block.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:42] !log ariel@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1003.eqiad.wmnet with reason: host reimage [06:07:51] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dumpsdata1003.eqiad.wmnet with reason: host reimage [06:08:22] 10ops-codfw: asw-d2-codfw:PEM0 down for 2 weeks - https://phabricator.wikimedia.org/T340002 (10ayounsi) p:05Triage→03High [06:22:06] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:23:18] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:24:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin: add all miscweb domains as extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [06:25:49] (03PS3) 10Jelto: admin: add all miscweb domains as extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) [06:32:32] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:32:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:33:05] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove dbproxy1022 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/931727 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:33:27] (03CR) 10Jelto: [C: 03+2] admin: add all miscweb domains as extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [06:33:56] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:35:55] (03Merged) 10jenkins-bot: admin: add all miscweb domains as extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [06:38:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [06:40:28] !log hashar@deploy1002 Started deploy [integration/docroot@51d2552]: Add TimedMediaHandler to docroot - T338458 [06:40:32] T338458: Add a JsDoc 3 annotations and docs pipeline to TMH - https://phabricator.wikimedia.org/T338458 [06:40:39] !log hashar@deploy1002 Finished deploy [integration/docroot@51d2552]: Add TimedMediaHandler to docroot - T338458 (duration: 00m 11s) [06:43:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [06:44:18] !log jelto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [06:44:49] !log jelto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [07:00:05] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230621T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:23] o/ nothing to do, it seems [07:02:59] (03PS1) 10Slyngshede: C:idm:deployment notify client on Redis reconfig [puppet] - 10https://gerrit.wikimedia.org/r/931863 [07:06:03] (03PS3) 10Slyngshede: idm.wikimedia.org: Failover test [dns] - 10https://gerrit.wikimedia.org/r/931599 (https://phabricator.wikimedia.org/T338008) [07:11:08] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/931599 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [07:13:16] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dumpsdata1003.eqiad.wmnet with OS bullseye [07:14:25] (03CR) 10Slyngshede: [C: 03+2] idm.wikimedia.org: Failover test [dns] - 10https://gerrit.wikimedia.org/r/931599 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [07:18:07] (ProbeDown) firing: Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:21:07] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [07:21:14] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations: custom partman recipe dumpsdata100X-no-data-format.cfg causes installer to hang at partitioning menu - https://phabricator.wikimedia.org/T339929 (10ArielGlenn) Found the --no-pxe option so this host has been reimaged. Besides the partition manage... [07:21:32] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [07:22:46] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:23:00] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [07:23:07] (ProbeDown) resolved: Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:25:07] (ProbeDown) firing: Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:29:06] 10SRE, 10Anti-Harassment, 10Data-Engineering, 10Traffic, and 2 others: Include User-Agent Client Hints in WebRequest logs - https://phabricator.wikimedia.org/T337947 (10kostajh) [07:30:07] (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:31:25] (03PS1) 10Slyngshede: Requirements: Add OIDC dependencies. [software/bitu] - 10https://gerrit.wikimedia.org/r/931865 [07:32:07] (ProbeDown) firing: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:32:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931863 (owner: 10Slyngshede) [07:32:59] (03CR) 10Elukey: [V: 03+1] "Hugh, Eric - what do you think? This change is a no-op for all clusters (except a mild" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [07:33:24] not sure what to do about that ^ [07:34:30] effie: I ack it, as it should not cause immediate issues? [07:35:09] actually it should [07:36:53] high latency on ipv6-esams [07:37:07] (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:37:27] should we depool? [07:37:37] let see for a moment what is the impact [07:37:45] !incidents [07:37:46] 3812 (UNACKED) ProbeDown sre (2620:0:862:ed1a::2 ip6 text:80 probes/service http_text_ip6 esams) [07:37:49] !ack [07:37:49] no value provided for parameter incident and no default available [07:37:49] Incident id must be an integer [07:37:53] !ack 3812 [07:37:53] 3812 (ACKED) ProbeDown sre (2620:0:862:ed1a::2 ip6 text:80 probes/service http_text_ip6 esams) [07:37:55] high latency on esams on ipv6 [07:38:11] remember to ACK please, we paged 40 ppl [07:38:27] yes, because people on call didnt answer [07:38:39] ok, that's an issue then [07:39:18] so, just IPv6 ? [07:39:28] I am creating the depool patch but not using it yet [07:39:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [07:40:19] (03CR) 10Slyngshede: [C: 03+2] C:idm:deployment notify client on Redis reconfig [puppet] - 10https://gerrit.wikimedia.org/r/931863 (owner: 10Slyngshede) [07:40:52] looking at nel [07:41:07] it was my bad, my phone was on silent [07:41:36] (03PS1) 10Jcrespo: Emergency depool of esams due to ipv6 latency issues [dns] - 10https://gerrit.wikimedia.org/r/931866 [07:41:44] ^ just in case [07:41:54] fyi, network is fine between eqiad/codfw and esams https://grafana.wikimedia.org/d/m1LYjVjnz/network-icmp-probes?orgId=1&var-site=eqiad&var-target_site=esams&var-role=cr&var-family=All [07:41:59] I am taking incident coodinator [07:42:06] ok, thanks [07:42:07] (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:42:20] !incidents [07:42:21] 3812 (ACKED) ProbeDown sre (2620:0:862:ed1a::2 ip6 text:80 probes/service http_text_ip6 esams) [07:42:41] nothing standing out in nel [07:42:58] https://docs.google.com/document/d/1gKaOn8KC2oLO0rQhwAW-fH4FEXuBHGvyDt2CpdakQ9k/edit [07:43:02] could this be the probe ? [07:43:16] akosiaris: the probe or related to T339898 [07:43:17] T339898: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 [07:43:25] (03PS1) 10Slyngshede: P:idm Redis failback to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/931868 [07:44:00] (03CR) 10Slyngshede: [C: 03+2] P:idm Redis failback to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/931868 (owner: 10Slyngshede) [07:44:26] ok, then hold our horses for a bit? [07:44:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [07:45:11] I will monitor phabricator for user reports just in case [07:45:53] jynus: happy eyeballs will hide it from users, won't it ? [07:45:55] I see no issue filed [07:46:05] it looks synthetic to me.. pybal doesn't complain at all [07:46:17] I will then ask on -tech [07:46:35] but I'll know more after I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/931625 [07:47:07] (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:47:40] (03PS1) 10Slyngshede: IDM: Failback to EQIAD [dns] - 10https://gerrit.wikimedia.org/r/931869 (https://phabricator.wikimedia.org/T338008) [07:47:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/931869 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [07:47:56] 7interestingly it's not transient. It also started slowly increasing from ~7:05 UTC [07:48:27] please someone have a look at superset just in case [07:49:00] jynus: port 80 traffic doesn't hit superset anymore [07:49:13] ah, port 80, I missed that [07:49:27] thanks, will make that clear [07:49:29] (03CR) 10Slyngshede: [C: 03+2] IDM: Failback to EQIAD [dns] - 10https://gerrit.wikimedia.org/r/931869 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [07:51:53] it failed before at :18 [07:52:02] it seems to be flopping [07:52:07] (ProbeDown) firing: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:53:16] jynus: I can't reproduce at all [07:53:33] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:54:20] ^ what about this thing on routers, I know it is a separate dc, but could it be related? [07:54:51] no, this cr has been complaining about high cpu usage [07:55:06] as long as traffic is happy we can close the incident and keep monitoring [07:55:24] anything weird on nel? [07:56:54] jynus: nothing on NEL [07:57:28] so I'd say last word on vgutierrez to check ipv6 (regular) traffic looks good and we close [07:57:35] (on esams) [07:58:01] we ack the thing, open a ticket and research a possible monitoring error [07:58:07] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:58:12] (03PS13) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [07:58:14] (03PS9) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) [07:58:14] jynus: we got a ticket related to this page already [07:58:24] I see [07:58:28] jynus: feel free to add extra infromation if you need [07:58:32] T339898 [07:58:32] T339898: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 [07:58:51] thanks [07:59:35] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41864/console" [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:00:36] so... [08:01:10] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41865/console" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:01:11] hi all just getting online, jynus can i get a status update, do you want me to take over IC [08:01:11] prometheus3002 reaching en.wikipedia.org:80 in esams is slower than from my house 3000km away [08:01:33] jbond: maybe because we are I think on post-incident now [08:01:42] (03CR) 10Elukey: [V: 03+1] cassandra: add initial support for PKI TLS certs to 4.x (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:02:07] jynus: ack i have it [08:03:24] vgutierrez: so is this the same issue we saw on monday? [08:04:15] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:04:45] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:05:05] (03PS1) 10Slyngshede: Password reset: Don't fail if a user doesn't have an email. [software/bitu] - 10https://gerrit.wikimedia.org/r/931872 [08:05:32] jbond: yep [08:05:39] ack [08:05:45] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:06:33] !incidents [08:06:33] 3812 (ACKED) ProbeDown sre (2620:0:862:ed1a::2 ip6 text:80 probes/service http_text_ip6 esams) [08:06:39] (03Abandoned) 10Jcrespo: Emergency depool of esams due to ipv6 latency issues [dns] - 10https://gerrit.wikimedia.org/r/931866 (owner: 10Jcrespo) [08:07:07] (ProbeDown) firing: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:07:15] no shit :) [08:07:28] I am gonna silence this one for a couple of hours in AM [08:07:36] akosiaris: I am on it [08:07:41] * akosiaris backs off [08:08:07] 10SRE, 10Traffic, 10Patch-For-Review: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 (10Vgutierrez) `counterexample vgutierrez@prometheus3002:~$ curl -w @- -o /dev/null --resolve www.wikipedia.org:80:91.198.174.192 -s http://www.wikipedia.org <<'... [08:08:19] XioNoX, topranks I could use your help with https://phabricator.wikimedia.org/T339898#8951849 [08:08:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [08:09:19] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:09:43] jbond,s suggestion to close the incident, while you keep monitoring and work is moved to the ticket ? [08:09:53] ^also effie [08:10:15] jynus: sgtm there is no user inpact for now [08:10:21] +1 [08:10:39] I just didn't see or knew it was only port 80 back then [08:11:02] I just saw high latency alert [08:11:07] alert text explicitly mentions port 80 :) [08:11:14] jynus: in that case ill put yuo back as the ic as i only did the last 5 mins [08:13:50] to not get fully blamed about being cautious, you should have communicated with people on call this could happen again at T339898 :-) [08:13:50] T339898: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 [08:14:11] so they knew the context and were ready to respond, I definitely wasn't [08:14:59] who is talking about blame here? :) [08:15:19] I am blaming myself :-P [08:15:25] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:16:25] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:16:57] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:17:02] raising warnings about every single first time incident, especially if one has 0 end-user repercussions, isn't exactly a nice paradigm. [08:17:19] no I mean about paging [08:17:33] that it could happen, I think it did it yesterday [08:18:08] not sure I follow [08:18:09] vgutierrez: I have some errands to run soon, but having a quick look, what does time_starttransfer mean at the TCP level? [08:18:20] (03CR) 10Muehlenhoff: [C: 03+2] Add a type to the protocol used in a Ferm service [puppet] - 10https://gerrit.wikimedia.org/r/931617 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:18:40] indeed it looks like it's only for the prometheus host, so I'd tend to 302 o11y [08:19:59] XioNoX: so starttransfer would be a 24s TTFB [08:20:10] weird [08:20:28] yeah, but is it the time to do the tcp handshake? or after it's established? [08:21:05] ping, nc, etc, are nominal [08:21:21] prometheus1005 has no issue btw [08:21:31] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:21:51] running the curl on all prometheus hosts [08:22:11] XioNoX: TCP handshake time gets reported on time_connect [08:22:26] appconnect should be the TLS hanshake, N/A in this scenario [08:22:37] tcpdump time :) [08:22:49] it's not just 3002 btw [08:23:00] 2005 too [08:23:05] prometheus2005 that is [08:23:05] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: OpenSent - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:23:50] huge diff with prometheus2006. 24secs vs 0.22s [08:23:52] of course [08:23:57] as soon as I spawn tcpdump [08:24:00] it returns in 0.2s [08:24:03] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:24:19] ehm, fixed ? [08:24:28] I just re-ran it and ... it's ok ? [08:24:30] what the... [08:24:32] transient issue as mentioned before :) [08:24:35] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:24:39] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:24:44] heisenbugs more like it [08:25:11] the moment you try to study it, ti disappears [08:26:15] (03CR) 10Elukey: [V: 03+1] "This change is finally a complete no-op for all clusters, I fixed the problem with the erb template :)" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:26:54] (03CR) 10Jbond: "lgtm assuming approvals" [puppet] - 10https://gerrit.wikimedia.org/r/931675 (https://phabricator.wikimedia.org/T339936) (owner: 10Ssingh) [08:30:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet [08:31:37] (ProbeDown) resolved: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:31:37] (ProbeDown) resolved: Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:32:07] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) OK, having talked to some folks in the Enterprise org and other teams and having eliminated a few possible problem... [08:32:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931699 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [08:35:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/931865 (owner: 10Slyngshede) [08:35:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet [08:35:32] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team, 10Patch-For-Review: Please add Abstract Wiki team members to `deployment` and `deploy-service` prod SRE groups - https://phabricator.wikimedia.org/T339936 (10taavi) `deployment` includes `deploy-service` rights, so granting both is not necessary. [08:35:59] (03PS1) 10Jbond: pontoon: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/931876 [08:36:14] !log disable puppet on A:cp before merging Ie84c15ebe3d85c5140f8b3dc25d13086c9f34041 [08:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:31] (03CR) 10Jbond: [C: 03+1] Add a type to the protocol used in a Ferm service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931617 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:36:41] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 36 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:37:26] (03CR) 10Muehlenhoff: [C: 03+2] "Oops, good catch! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/931876 (owner: 10Jbond) [08:38:54] (03CR) 10Jbond: [C: 03+1] cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:39:37] (03CR) 10Vgutierrez: [C: 03+2] haproxy,mtail: Track locally processed requests in cache::haproxy [puppet] - 10https://gerrit.wikimedia.org/r/931625 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [08:42:11] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 23 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:42:59] (03CR) 10Jbond: dev env: rsyslog exporter, in container env listen on all interfaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931695 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:44:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931693 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:48:16] (03PS3) 10Arturo Borrero Gonzalez: wikimediacloud.org: cleanup for ns.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/931590 (https://phabricator.wikimedia.org/T307357) [08:48:33] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:48:47] (03PS3) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh A for ns1.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/931589 (https://phabricator.wikimedia.org/T307357) [08:49:03] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:49:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [08:50:33] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:51:33] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:53:27] (03CR) 10Volans: "LGTM, one leftover inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [08:54:13] (03PS1) 10Vgutierrez: haproxy: Capture host header in port 80 as well [puppet] - 10https://gerrit.wikimedia.org/r/931878 (https://phabricator.wikimedia.org/T339898) [08:55:16] (03CR) 10Fabfur: [C: 03+1] haproxy: Capture host header in port 80 as well [puppet] - 10https://gerrit.wikimedia.org/r/931878 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [08:56:01] (03CR) 10Volans: [C: 04-1] cumin: Increase connect_timeout for slow servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [08:57:12] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Capture host header in port 80 as well [puppet] - 10https://gerrit.wikimedia.org/r/931878 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [08:57:15] 10SRE, 10Infrastructure-Foundations: PuppetDB Netbox import script failing for cloudservices2004-dev - https://phabricator.wikimedia.org/T339953 (10jbond) 05Open→03Resolved a:03jbond I have deactivated the old node (`puppetmaster1001$ sudo puppet node deactivate cloudservices2004-dev.wikimedia.org`) and... [08:57:36] (03CR) 10David Caro: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/931590 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [08:58:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [08:59:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: cleanup for ns.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/931590 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [09:00:10] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T340013 (10Mickeychou) [09:02:15] (03CR) 10David Caro: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/931589 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [09:03:33] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T340014 (10Mickeychou) [09:04:12] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T340014 (10Mickeychou) map work [09:06:05] 10SRE-tools, 10Infrastructure-Foundations: Read Ganeti cluster config for cookboks from Netbox - https://phabricator.wikimedia.org/T340015 (10MoritzMuehlenhoff) [09:06:09] !log disable puppet on R:git::clone to deploy gerrit:927750 [09:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:45] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:08:30] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Ladsgroup) https://lists.wikimedia.org/postorius/lists/glam-de.lists.wikimedia.org/ already exists, you probably need to pick another name (that mailing list is private) [09:08:35] (03CR) 10Jbond: [C: 03+2] "lgtm ill merge and give it a go" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [09:08:36] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Ladsgroup) a:03Ladsgroup [09:09:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: refresh A for ns1.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/931589 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [09:10:03] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:10:28] (03CR) 10EoghanGaffney: [C: 03+2] apt: Add jenkins packages to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/931600 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [09:13:37] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:14:32] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test sync - jbond@cumin1001" [09:15:38] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "test sync - jbond@cumin1001" [09:16:24] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test sync - jbond@cumin1001" [09:17:03] !log jbond@cumin1001 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "test sync - jbond@cumin1001" [09:17:06] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test sync - jbond@cumin1001" [09:17:28] 10SRE, 10Wikimedia-Mailing-lists: Removal of email address from mailman removed all subscriptions - https://phabricator.wikimedia.org/T339341 (10Ladsgroup) From members table, I explicitly removed you from stewards and ops but it had one per mailing list. Is it any better? [09:18:19] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "test sync - jbond@cumin1001" [09:18:45] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:19:12] (03PS1) 10Slyngshede: C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 [09:19:35] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:19:36] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@0c82f2d] (releasing): (no justification provided) [09:20:20] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test sync - jbond@cumin1001" [09:20:41] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 37 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:20:50] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@0c82f2d] (releasing): (no justification provided) (duration: 01m 14s) [09:21:32] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "test sync - jbond@cumin1001" [09:22:23] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:23:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [09:23:32] (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns: add backup for the database [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) [09:25:00] (03PS1) 10Vgutierrez: haproxy: Add X-Cache-Status for port 80 responses [puppet] - 10https://gerrit.wikimedia.org/r/931881 (https://phabricator.wikimedia.org/T339898) [09:25:40] (03CR) 10Vgutierrez: "for reviewing purposes http-response set-header is documented here: https://docs.haproxy.org/2.6/configuration.html#4.2-http-response%20se" [puppet] - 10https://gerrit.wikimedia.org/r/931881 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [09:26:09] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 33 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:27:37] (03PS17) 10Muehlenhoff: Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) [09:27:54] (03CR) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [09:28:17] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:28:30] (Processor usage over 85%) firing: (2) Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [09:28:41] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: OpenSent - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:30:25] 10SRE, 10Wikimedia-Mailing-lists: Removal of email address from mailman removed all subscriptions - https://phabricator.wikimedia.org/T339341 (10hoo) >>! In T339341#8952050, @Ladsgroup wrote: > From members table, I explicitly removed you from stewards and ops but it had one per mailing list. Is it any better?... [09:30:59] (03CR) 10Volans: [C: 03+1] "LGTM, feel free to test it with the test-cookbook new command ;) both in dry-run and not if needed." [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [09:31:47] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:32:13] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/931880/41867/" [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [09:32:55] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:33:30] (Processor usage over 85%) firing: (2) Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [09:33:37] PROBLEM - DPKG on releases2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:37:25] PROBLEM - DPKG on releases1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:38:53] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:41:06] (03PS2) 10Vgutierrez: haproxy: Add self-id headers for port 80 responses [puppet] - 10https://gerrit.wikimedia.org/r/931881 (https://phabricator.wikimedia.org/T339898) [09:42:30] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41868/console" [puppet] - 10https://gerrit.wikimedia.org/r/931881 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [09:42:44] (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931881 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [09:48:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [09:49:25] (03CR) 10Stevemunene: [C: 03+2] Fix datahub connections [puppet] - 10https://gerrit.wikimedia.org/r/931690 (https://phabricator.wikimedia.org/T333004) (owner: 10Aqu) [09:49:55] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:50:27] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:51:21] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 43 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:53:21] (03PS3) 10Vgutierrez: haproxy: Add self-id headers for port 80 responses [puppet] - 10https://gerrit.wikimedia.org/r/931881 (https://phabricator.wikimedia.org/T339898) [09:53:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:09] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T340014 (10Reedy) [09:57:13] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T340013 (10Reedy) [09:58:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:58:59] !log dcausse@deploy1002 Started deploy [airflow-dags/search@9c03845]: search: schedule cirrus_consistency_check [09:59:17] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@9c03845]: search: schedule cirrus_consistency_check (duration: 00m 18s) [09:59:26] (03CR) 10Effie Mouzeli: "I like this idea, I would suggest we consider making this action slightly more human readable, potentially something like" [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230621T1000) [10:00:55] (03PS1) 10Jbond: git::clone: use = not == [puppet] - 10https://gerrit.wikimedia.org/r/931882 (https://phabricator.wikimedia.org/T290260) [10:01:41] (03CR) 10Jbond: [C: 03+2] git::clone: use = not == [puppet] - 10https://gerrit.wikimedia.org/r/931882 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [10:02:23] (03PS4) 10Vgutierrez: haproxy: Add self-id headers for port 80 responses [puppet] - 10https://gerrit.wikimedia.org/r/931881 (https://phabricator.wikimedia.org/T339898) [10:03:06] (03CR) 10Vgutierrez: "http-after-response (https://docs.haproxy.org/2.6/configuration.html#http-after-response) is needed here to be able to add headers to an i" [puppet] - 10https://gerrit.wikimedia.org/r/931881 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [10:03:58] (03CR) 10Vgutierrez: haproxy: Add self-id headers for port 80 responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931881 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [10:04:09] RECOVERY - DPKG on releases2003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:45] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Lucy_Patterson_WMDE) Hey Amir! hmm. Do we know if it's still an active group? Is it worth trying to reclaim it..? Or better to just come up with another name...? [10:05:31] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T340014 (10Aklapper) 05Open→03Invalid @Mickeychou: Do not create empty tickets but fill out ALL required information always. Thanks. [10:07:20] (03CR) 10Fabfur: [C: 03+1] "Looks fine!" [puppet] - 10https://gerrit.wikimedia.org/r/931881 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [10:07:57] RECOVERY - DPKG on releases1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:13:57] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Ladsgroup) Hey! Nice to talk to you ^_^ It is not active and in theory we could take ownership of it and reclaim it but it is a private mailing list and we can't make the mailing list public without t... [10:15:21] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Add self-id headers for port 80 responses [puppet] - 10https://gerrit.wikimedia.org/r/931881 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [10:19:49] 10SRE, 10Wikimedia-Mailing-lists: Removal of email address from mailman removed all subscriptions - https://phabricator.wikimedia.org/T339341 (10Ladsgroup) 05Open→03Resolved Awesome. I close it, let me know if you have trouble getting email from stewards or ops. [10:24:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db[1124,1133].eqiad.wmnet with reason: Testing cloning [10:24:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db[1124,1133].eqiad.wmnet with reason: Testing cloning [10:28:04] (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.2.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/931884 [10:29:09] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:30:40] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.2.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/931884 (owner: 10Volans) [10:30:41] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:32:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:53] (03PS1) 10Jbond: git::clone: add support for alternate remotes [puppet] - 10https://gerrit.wikimedia.org/r/931887 (https://phabricator.wikimedia.org/T290260) [10:35:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:35:41] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 35 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:35:45] (03CR) 10FNegri: [V: 03+1] cumin: Increase connect_timeout for slow servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [10:36:02] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Lucy_Patterson_WMDE) do we know if there are any archives? [10:36:29] (03PS1) 10Volans: Upstream release v7.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/931888 [10:37:07] !log dcausse@deploy1002 Started deploy [airflow-dags/search@29d9615]: search: schedule cirrus_consistency_check (take 2) [10:37:18] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@29d9615]: search: schedule cirrus_consistency_check (take 2) (duration: 00m 10s) [10:39:29] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Ladsgroup) Yeah, there are actually a lot https://lists.wikimedia.org/hyperkitty/list/glam-de@lists.wikimedia.org/latest?count=200 Holger should have access to them. [10:39:41] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:41:49] (03PS1) 10Muehlenhoff: Add missing types to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) [10:42:29] (03CR) 10Volans: [C: 03+2] Upstream release v7.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/931888 (owner: 10Volans) [10:43:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [10:44:19] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 59 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:44:25] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:44:40] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931887 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [10:46:33] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:47:10] (03Merged) 10jenkins-bot: Upstream release v7.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/931888 (owner: 10Volans) [10:49:35] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:51:17] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:51:53] !log uploaded spicerack_7.2.1 to apt.wikimedia.org bullseye-wikimedia [10:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:07] (03PS1) 10Vgutierrez: cache: Capture X-Cache-Status on port 80 frontend [puppet] - 10https://gerrit.wikimedia.org/r/931891 (https://phabricator.wikimedia.org/T339898) [10:52:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:53:35] (03CR) 10Vgutierrez: [C: 03+2] cache: Capture X-Cache-Status on port 80 frontend [puppet] - 10https://gerrit.wikimedia.org/r/931891 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [10:53:45] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:55:27] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 33 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:55:57] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:56:58] (03CR) 10Jbond: [C: 03+2] git::clone: add support for alternate remotes [puppet] - 10https://gerrit.wikimedia.org/r/931887 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [10:57:04] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@0c82f2d] (releasing): (no justification provided) [10:57:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:57:53] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@0c82f2d] (releasing): (no justification provided) (duration: 00m 48s) [10:58:38] !log re-enable puppet in A:cp - T339898 [10:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:42] T339898: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 [10:58:46] (03PS1) 10ArielGlenn: correct and expand docs for dumps nfs share testing [puppet] - 10https://gerrit.wikimedia.org/r/931893 (https://phabricator.wikimedia.org/T325232) [10:59:13] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Lucy_Patterson_WMDE) Ah - got it. I also have access once I subscribed and made an account. Thanks Amir. Looks like there's no way to use this list for the purpose we had in mind. I'll have to come up... [10:59:19] (03CR) 10Nikerabbit: [C: 04-1] TranslationNotifications: Run UnsubscribeInactiveUsers periodically (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928159 (https://phabricator.wikimedia.org/T323192) (owner: 10Abijeet Patro) [11:02:14] !log installing python2.7 security updates [11:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [11:04:01] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 47 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:11:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:12:27] RECOVERY - BGP status on cr3-knams is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:12:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:16:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:25:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930845 [11:25:23] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930845 (owner: 10TrainBranchBot) [11:26:05] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 34 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:26:41] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10SLyngshede-WMF) I'm having a bit of a hard time parsing the PHP code... mostly because PHP3 was the last version I use... [11:31:29] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10Reedy) [11:31:57] (03PS1) 10Jelto: traficserver: switch miscweb transparency backend [puppet] - 10https://gerrit.wikimedia.org/r/931894 (https://phabricator.wikimedia.org/T338781) [11:32:30] (03PS1) 10Jbond: git::clone: Add types docs and minor updates [puppet] - 10https://gerrit.wikimedia.org/r/931895 (https://phabricator.wikimedia.org/T290260) [11:32:42] (03CR) 10CI reject: [V: 04-1] git::clone: Add types docs and minor updates [puppet] - 10https://gerrit.wikimedia.org/r/931895 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [11:34:29] (03PS1) 10Hnowlan: swift: add logging for when private connections are used [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/931896 (https://phabricator.wikimedia.org/T338765) [11:34:36] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10taavi) >>! In T339917#8952350, @SLyngshede-WMF wrote: > I'm having a bit of a hard time parsing the PHP code... mostly... [11:37:38] (03PS1) 10Jelto: httpbb: move tests for transparency.wikimedia.org to miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/931897 (https://phabricator.wikimedia.org/T338781) [11:38:37] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10Reedy) https://noc.wikimedia.org/conf/highlight.php?file=wikitech.php `lang=php $wgLDAPPreferences = [ 'labs' => [ "e... [11:38:45] (03PS2) 10ArielGlenn: correct and expand docs for dumps nfs share testing [puppet] - 10https://gerrit.wikimedia.org/r/931893 (https://phabricator.wikimedia.org/T325232) [11:39:36] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add hostname with port [deployment-charts] - 10https://gerrit.wikimedia.org/r/931281 (owner: 10Hnowlan) [11:39:45] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [11:39:47] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [11:39:52] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [11:39:55] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [11:40:12] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: sync [11:40:16] (03CR) 10CI reject: [V: 04-1] swift: add logging for when private connections are used [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/931896 (https://phabricator.wikimedia.org/T338765) (owner: 10Hnowlan) [11:40:26] (03Merged) 10jenkins-bot: rest-gateway: add hostname with port [deployment-charts] - 10https://gerrit.wikimedia.org/r/931281 (owner: 10Hnowlan) [11:40:29] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10SLyngshede-WMF) @taavi Thank you :-) It is kinda fun poke around in though, but I think I need to have a debug/test e... [11:40:30] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: sync [11:40:43] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync [11:40:59] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync [11:42:39] (03PS1) 10Jelto: miscweb: move probes for transparency to profile::microsites::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/931898 (https://phabricator.wikimedia.org/T338781) [11:45:48] (03PS2) 10Jelto: miscweb: move probes for transparency to profile::microsites::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/931898 (https://phabricator.wikimedia.org/T338781) [11:46:03] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:46:27] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: relocate some LDAP hiera [puppet] - 10https://gerrit.wikimedia.org/r/931899 [11:46:41] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:47:47] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:50:51] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:51:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930845 (owner: 10TrainBranchBot) [11:51:58] (03PS2) 10Jbond: git::clone: Add types docs and minor updates [puppet] - 10https://gerrit.wikimedia.org/r/931895 (https://phabricator.wikimedia.org/T290260) [11:52:10] (03CR) 10CI reject: [V: 04-1] git::clone: Add types docs and minor updates [puppet] - 10https://gerrit.wikimedia.org/r/931895 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [11:52:12] (03PS1) 10Arturo Borrero Gonzalez: cloud: drop references to cloudservices2005-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/931900 (https://phabricator.wikimedia.org/T338779) [11:52:47] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 56 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:53:06] (03CR) 10Muehlenhoff: [C: 03+2] Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [11:53:21] (03PS3) 10Jbond: git::clone: Add types docs and minor updates [puppet] - 10https://gerrit.wikimedia.org/r/931895 (https://phabricator.wikimedia.org/T290260) [11:56:06] (03PS2) 10Arturo Borrero Gonzalez: codfw1dev: relocate some LDAP hiera [puppet] - 10https://gerrit.wikimedia.org/r/931899 [11:56:08] (03PS2) 10Arturo Borrero Gonzalez: cloud: drop references to cloudservices2005-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/931900 (https://phabricator.wikimedia.org/T338779) [11:57:01] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia, AS1299/IPv6: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:57:41] (03CR) 10Jbond: "lgtm but suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:57:55] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC NOOP https://puppet-compiler.wmflabs.org/output/931899/41875/" [puppet] - 10https://gerrit.wikimedia.org/r/931899 (owner: 10Arturo Borrero Gonzalez) [11:58:48] (03CR) 10Jelto: miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [12:00:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: drop references to cloudservices2005-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/931900 (https://phabricator.wikimedia.org/T338779) (owner: 10Arturo Borrero Gonzalez) [12:01:16] (03PS3) 10ArielGlenn: correct and expand docs for dumps nfs share testing [puppet] - 10https://gerrit.wikimedia.org/r/931893 (https://phabricator.wikimedia.org/T325232) [12:01:38] !log aborrero@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudservices2005-dev [12:02:02] (03CR) 10Muehlenhoff: Add missing types to ferm::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:02:09] 10SRE-tools, 10Infrastructure-Foundations: Read Ganeti cluster config for cookbooks from Netbox - https://phabricator.wikimedia.org/T340015 (10Reedy) [12:03:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:03:13] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:03:20] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test change - jbond@cumin1001" [12:03:53] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 28 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:05:45] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:06:40] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "test change - jbond@cumin1001" [12:06:43] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test change - jbond@cumin1001" [12:06:47] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:07:41] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:08:11] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [12:11:54] (03CR) 10Jbond: "Unfortunately the PCC is a bit noise as this dose make some minor resources to the exec resources specifically:" [puppet] - 10https://gerrit.wikimedia.org/r/931895 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [12:12:19] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "test change - jbond@cumin1001" [12:12:22] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond) [12:12:26] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [12:12:30] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [12:12:55] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:13:51] (03CR) 10Jbond: [C: 03+1] "sgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:14:02] (03PS1) 10Muehlenhoff: Use sprintf() to build the config file [puppet] - 10https://gerrit.wikimedia.org/r/931901 (https://phabricator.wikimedia.org/T336497) [12:14:04] (03PS1) 10Muehlenhoff: Switch all uses priority of ferm::service to numeric values [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) [12:14:26] (03CR) 10Elukey: [V: 03+1] cassandra: add initial support for PKI TLS certs to 4.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [12:14:45] (03CR) 10CI reject: [V: 04-1] Use sprintf() to build the config file [puppet] - 10https://gerrit.wikimedia.org/r/931901 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:14:49] (03CR) 10CI reject: [V: 04-1] Switch all uses priority of ferm::service to numeric values [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:15:58] (03Merged) 10jenkins-bot: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [12:16:00] (03Merged) 10jenkins-bot: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [12:16:02] (03Merged) 10jenkins-bot: sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond) [12:16:14] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [12:16:22] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [12:16:27] 10SRE, 10DC-Ops, 10Infrastructure-Foundations: private repo deployment - perccli implementation - https://phabricator.wikimedia.org/T308027 (10MoritzMuehlenhoff) [12:16:33] 10SRE, 10Infrastructure-Foundations: icinga raid monitoring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This got added by Simon in https://gerrit.wikimedia.org/r/c/operations/puppet/+/825728/ and was tested in h... [12:17:33] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:17:41] (03PS2) 10Muehlenhoff: Use sprintf() to build the config file [puppet] - 10https://gerrit.wikimedia.org/r/931901 (https://phabricator.wikimedia.org/T336497) [12:18:03] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 37 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:18:07] (03CR) 10CI reject: [V: 04-1] Use sprintf() to build the config file [puppet] - 10https://gerrit.wikimedia.org/r/931901 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:18:39] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: OpenConfirm - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:19:44] (03PS3) 10Muehlenhoff: Use sprintf() to build the config file [puppet] - 10https://gerrit.wikimedia.org/r/931901 (https://phabricator.wikimedia.org/T336497) [12:21:45] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:23:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931901 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:23:33] (03PS2) 10Jbond: Switch all uses priority of ferm::service to numeric values [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:24:11] (03CR) 10CI reject: [V: 04-1] Switch all uses priority of ferm::service to numeric values [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:24:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:30:16] (03PS14) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [12:30:18] (03PS10) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) [12:30:20] (03PS1) 10Elukey: role::ml_cache::storage: use pki truststore [puppet] - 10https://gerrit.wikimedia.org/r/931903 (https://phabricator.wikimedia.org/T288470) [12:30:39] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2005-dev decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [12:31:34] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41877/console" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [12:33:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41878/console" [puppet] - 10https://gerrit.wikimedia.org/r/931903 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [12:34:07] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: OpenSent - Telia, AS1299/IPv6: OpenSent - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:35:08] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41879/console" [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [12:35:12] (03PS3) 10Muehlenhoff: Switch all uses priority of ferm::service to numeric values [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) [12:35:24] (03CR) 10Elukey: [V: 03+1] "Ok folks sorry I know that you probably hate me, but I added an extra bit to change the truststore before rolling out the pki keystores. I" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [12:35:28] (03CR) 10Muehlenhoff: Switch all uses priority of ferm::service to numeric values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:37:53] (03CR) 10Elukey: [V: 03+1] "In this case the change is to all cluster, but we can safely disable puppet and work on a single node first." [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [12:38:21] (03PS4) 10ArielGlenn: correct and expand docs for dumps nfs share testing [puppet] - 10https://gerrit.wikimedia.org/r/931893 (https://phabricator.wikimedia.org/T325232) [12:38:43] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:38:47] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:39:15] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:39:56] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test change - jbond@cumin1001" [12:41:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "test change - jbond@cumin1001" [12:41:51] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test change - jbond@cumin1001" [12:42:33] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "test change - jbond@cumin1001" [12:45:57] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:46:06] (03PS3) 10Elukey: role::cache::{text,upload}: move vk codfw instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/931498 (https://phabricator.wikimedia.org/T337825) [12:46:27] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:46:45] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Requirements: Add OIDC dependencies. [software/bitu] - 10https://gerrit.wikimedia.org/r/931865 (owner: 10Slyngshede) [12:47:31] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41880/console" [puppet] - 10https://gerrit.wikimedia.org/r/931498 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [12:47:37] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2005-dev decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [12:47:37] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:47:38] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudservices2005-dev [12:49:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:49:50] (03PS1) 10Muehlenhoff: sre.ganeti.drain-node: Fix arg parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/931905 [12:50:32] (03CR) 10Vgutierrez: [C: 03+1] role::cache::{text,upload}: move vk codfw instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/931498 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [12:50:57] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::cache::{text,upload}: move vk codfw instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/931498 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [12:51:15] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 33 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:51:42] !log move varnishafka instances in codfw to PKI [12:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:03] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [12:54:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:54:16] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:55:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:57:03] (03CR) 10Hashar: [C: 03+1] "I went through some of the PCC entries and that looks good to me. The updated code looks a bit more pleasant (though there are some dark" [puppet] - 10https://gerrit.wikimedia.org/r/931895 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [12:57:09] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:57:42] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [12:57:46] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.drain-node: Fix arg parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/931905 (owner: 10Muehlenhoff) [12:58:56] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:58:58] (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: Add platform [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) [12:59:47] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 56 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230621T1300) [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:00:45] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:01:45] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:01:59] (03CR) 10Klausman: [C: 03+1] role::ml_cache::storage: use pki truststore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931903 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [13:02:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] charts/eventgate - Use mesh.networkpolicy.egress and base.networkpolicy.egress.kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T335024) (owner: 10Giuseppe Lavagetto) [13:02:04] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T335024) (owner: 10Giuseppe Lavagetto) [13:02:18] (03CR) 10Klausman: [C: 03+1] role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [13:04:56] (03CR) 10Elukey: [C: 03+1] charts/eventgate - Use mesh.networkpolicy.egress and base.networkpolicy.egress.kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T335024) (owner: 10Giuseppe Lavagetto) [13:05:29] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:06:14] (03PS2) 10Elukey: role::ml_cache::storage: use pki truststore [puppet] - 10https://gerrit.wikimedia.org/r/931903 (https://phabricator.wikimedia.org/T288470) [13:06:20] (03PS1) 10Muehlenhoff: Fix test to also allow ganeti-test servers (using a slightly different role) [cookbooks] - 10https://gerrit.wikimedia.org/r/931928 [13:06:22] (03CR) 10Elukey: role::ml_cache::storage: use pki truststore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931903 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [13:06:25] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:06:32] (03PS11) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) [13:07:01] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:08:31] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [13:09:55] (03PS2) 10Jbond: sre.puppet.sync-netbox-hiera: Add platform [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) [13:11:31] (03CR) 10Volans: [C: 03+1] "LGTM, FYI the test-cookbook script is there to allow to check those small details before merging ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/931928 (owner: 10Muehlenhoff) [13:12:32] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1023.eqiad.wmnet with OS bullseye [13:12:37] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors: - db... [13:13:15] (03PS1) 10Jbond: netbox: expand device schema to include optional platform [puppet] - 10https://gerrit.wikimedia.org/r/931930 (https://phabricator.wikimedia.org/T329669) [13:13:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:15:05] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Traffic, 10serviceops: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) [13:15:43] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Traffic, 10serviceops: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) p:05Triage→03High [13:15:57] (03PS2) 10Slyngshede: Password reset: Don't fail if a user doesn't have an email. [software/bitu] - 10https://gerrit.wikimedia.org/r/931872 [13:16:01] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1024 [13:16:34] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Password reset: Don't fail if a user doesn't have an email. [software/bitu] - 10https://gerrit.wikimedia.org/r/931872 (owner: 10Slyngshede) [13:16:53] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Traffic, 10serviceops: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) [13:17:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1024 [13:17:53] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Traffic, 10serviceops: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) [13:18:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:18:16] (03CR) 10Muehlenhoff: [C: 03+2] Fix test to also allow ganeti-test servers (using a slightly different role) [cookbooks] - 10https://gerrit.wikimedia.org/r/931928 (owner: 10Muehlenhoff) [13:18:28] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1023 [13:18:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1023 [13:18:37] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1025 [13:19:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1025 [13:19:41] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) [13:21:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41882/console" [puppet] - 10https://gerrit.wikimedia.org/r/931930 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [13:21:46] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930601 (owner: 10PipelineBot) [13:22:05] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1024.mgmt.eqiad.wmnet with reboot policy FORCED [13:22:06] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1025.mgmt.eqiad.wmnet with reboot policy FORCED [13:23:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [13:23:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2002.codfw.wmnet [13:25:25] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318 (10JameelKaisar) You can see Box Plots and Violin Plots of the per-country latency data here: ` stat1009.e... [13:26:31] (03CR) 10Jameel Kaisar: [C: 03+1] Update mappings for some countries based on initial Probenet data (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [13:26:48] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1025.mgmt.eqiad.wmnet with reboot policy FORCED [13:26:50] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1024.mgmt.eqiad.wmnet with reboot policy FORCED [13:27:44] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:28:06] (03PS4) 10FNegri: cumin: Increase connect_timeout for slow servers [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [13:28:15] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:28:24] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:28:48] (03CR) 10Volans: "I'm not that familiar with the IDM setup but it seems to do the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [13:28:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:30:01] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: sync [13:30:09] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: sync [13:30:26] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1024 [13:30:59] (03CR) 10FNegri: "I updated the patch to fix the issue." [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:31:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1024 [13:33:48] (03CR) 10Volans: [C: 04-1] netbox: expand device schema to include optional platform (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931930 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [13:36:11] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:36:16] (03CR) 10Volans: cumin: Increase connect_timeout for slow servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:36:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: expand device schema to include optional platform [puppet] - 10https://gerrit.wikimedia.org/r/931930 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [13:36:25] (03CR) 10Muehlenhoff: [C: 03+2] Fix test to also allow ganeti-test servers (using a slightly different role) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/931928 (owner: 10Muehlenhoff) [13:36:51] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [13:36:51] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2002.codfw.wmnet [13:36:53] (03CR) 10Ayounsi: [C: 04-1] "We don't use the platform field see https://phabricator.wikimedia.org/T336623" [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [13:36:57] (03PS2) 10Jbond: netbox: expand device schema to include optional platform [puppet] - 10https://gerrit.wikimedia.org/r/931930 (https://phabricator.wikimedia.org/T329669) [13:37:23] (03CR) 10Jbond: "fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/931930 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [13:37:28] (03CR) 10Volans: [C: 03+1] Fix test to also allow ganeti-test servers (using a slightly different role) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/931928 (owner: 10Muehlenhoff) [13:38:23] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:38:29] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) So, we need something to identify those users. wikiwand, if I understand the usage of the MCS en... [13:39:11] !log installed spicerack 7.2.1 to the cumin/cloudcumin hosts [13:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:57] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:40:53] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:41:07] (03PS3) 10Jbond: netbox: expand device schema to include optional platform [puppet] - 10https://gerrit.wikimedia.org/r/931930 (https://phabricator.wikimedia.org/T329669) [13:41:36] (03CR) 10FNegri: "Additional note: we don't need this increased timeout in cloud-cumin-03 because it is not using the bastion, and works fine with a smaller" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:42:02] (03CR) 10FNegri: cumin: Increase connect_timeout for slow servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:43:07] !log robh@cumin1001 START - Cookbook sre.dns.netbox [13:43:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [13:43:36] (03CR) 10Jbond: "change to use manufacturer and a simple string (as we have lots of manufactures)" [puppet] - 10https://gerrit.wikimedia.org/r/931930 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [13:43:42] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1025 [13:43:59] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:44:39] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:44:54] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1025 [13:45:08] (03PS1) 10Slyngshede: P:idp:services Add netbox_oidc [puppet] - 10https://gerrit.wikimedia.org/r/931937 [13:45:14] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1024 - robh@cumin1001" [13:45:45] (03CR) 10FNegri: cumin: Increase connect_timeout for slow servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:45:59] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1024 - robh@cumin1001" [13:45:59] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:46:49] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1024 [13:47:09] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:47:32] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Clement_Goubert) Release has been cut, cookbook works ok now. @Volans do you want to keep the task open to keep looking for a way to enforce... [13:47:46] (03PS41) 10Slyngshede: P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) [13:47:47] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:47:56] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1024 [13:48:44] (03PS42) 10Slyngshede: P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) [13:49:11] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Volans) I was about to come here and comment as I just deployed spicerack to production :) I think we can keep it open for now to see if we... [13:49:51] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1025.eqiad.wmnet with OS bullseye [13:49:58] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye [13:50:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41883/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [13:50:31] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1024.eqiad.wmnet with OS bullseye [13:50:38] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dbproxy1024.eqiad.wmnet with OS bullseye [13:52:54] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) >>! In T340036#8952804, @akosiaris wrote: > So, we need something to identify those users. wikiwan... [13:53:06] (03PS3) 10Jbond: sre.puppet.sync-netbox-hiera: Add platform [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) [13:53:27] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:53:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [13:53:47] (03CR) 10JHathaway: dev env: rsyslog exporter, in container env listen on all interfaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931695 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [13:54:05] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:54:13] (03CR) 10Ayounsi: [C: 03+1] sre.puppet.sync-netbox-hiera: Add platform (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [13:54:59] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:01] (03CR) 10JHathaway: [C: 03+2] sshd: don't add AuthorizedKeysFile when we have no keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931693 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [13:55:25] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) >>! In T340036#8952836, @MSantos wrote: > Sounds great to me, no objections. Cool. Do we have K... [13:56:12] (03CR) 10Slyngshede: [V: 03+1] P:netbox reconfigure to used OIDC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [13:56:29] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:43] (03PS4) 10Jbond: sre.puppet.sync-netbox-hiera: Add platform [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) [13:57:00] (03CR) 10Jbond: "cheers" [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [13:57:17] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:58:14] effie: ^ [13:58:41] wrong channel, disregard [13:58:48] but this is a page nonetheless [13:59:34] (03PS1) 10Ilias Sarantopoulos: ores: enable liftwing on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931939 (https://phabricator.wikimedia.org/T319170) [13:59:50] i see a small peek of tcp resets but allready seems to be going down [13:59:58] It's going to be resolved soon, small peak, subsiding already [14:00:10] ahh ok its none thanks [14:00:15] something in the US ? [14:00:20] (03CR) 10Volans: cumin: Increase connect_timeout for slow servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [14:00:35] yes seems to be [14:01:09] specifically text-lb.codfw.wikimedia.org. [14:01:51] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1025.eqiad.wmnet with reason: host reimage [14:02:17] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:02:30] (03CR) 10Ladsgroup: "The changes to ext-ORES files is not related and should be discarded" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931939 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [14:02:32] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1024.eqiad.wmnet with reason: host reimage [14:03:08] !incidents [14:03:09] 3819 (RESOLVED) NELHigh sre (tcp.timed_out) [14:03:09] 3812 (RESOLVED) ProbeDown sre (2620:0:862:ed1a::2 ip6 text:80 probes/service http_text_ip6 esams) [14:03:19] (03PS2) 10Ilias Sarantopoulos: ores: enable liftwing on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931939 (https://phabricator.wikimedia.org/T319170) [14:04:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1025.eqiad.wmnet with reason: host reimage [14:05:00] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T339884 (10Jhancock.wm) this is a known issue. will resolve when able [14:06:25] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [14:06:31] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:06:46] hi [14:07:35] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1024.eqiad.wmnet with reason: host reimage [14:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:17] (03PS1) 10Jcrespo: dbbackups: Make mydumper backup template path configurable [puppet] - 10https://gerrit.wikimedia.org/r/931940 (https://phabricator.wikimedia.org/T339894) [14:08:39] (03CR) 10CI reject: [V: 04-1] dbbackups: Make mydumper backup template path configurable [puppet] - 10https://gerrit.wikimedia.org/r/931940 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [14:09:45] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:09:58] (03PS2) 10Jcrespo: dbbackups: Make mydumper backup template path configurable [puppet] - 10https://gerrit.wikimedia.org/r/931940 (https://phabricator.wikimedia.org/T339894) [14:11:22] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:11:22] (03CR) 10Ladsgroup: [C: 03+2] ores: enable liftwing on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931939 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [14:11:27] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:12:04] (03CR) 10CI reject: [V: 04-1] dbbackups: Make mydumper backup template path configurable [puppet] - 10https://gerrit.wikimedia.org/r/931940 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [14:12:14] (03CR) 10JHathaway: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/931895 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [14:12:32] (03CR) 10Ssingh: [C: 03+1] site: add buster people VMs to insetup role for decom [puppet] - 10https://gerrit.wikimedia.org/r/931699 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [14:12:36] (03Merged) 10jenkins-bot: ores: enable liftwing on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931939 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [14:12:53] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:13:40] (03CR) 10Ilias Sarantopoulos: ores: enable liftwing on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931939 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [14:15:18] (03PS3) 10Jcrespo: dbbackups: Make mydumper backup template path configurable [puppet] - 10https://gerrit.wikimedia.org/r/931940 (https://phabricator.wikimedia.org/T339894) [14:16:01] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:16:34] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10akosiaris) Implementing this is rather easy, but @TheDJ has a point. https://wikitech.wikimedia.org/wiki/Maps/External_usage says //maps.wikimedia.org tiles may only be used by Wikimedia wikis, and sites h... [14:16:59] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:17:07] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:17:24] 10SRE, 10ops-codfw: asw-d2-codfw:PEM0 down for 2 weeks - https://phabricator.wikimedia.org/T340002 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm power cord reseated firmly. confirmed click on the PDU. both PSUs have a green light. [14:17:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:49] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [14:19:19] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:19:47] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2005-dev - aborrero@cumin2002" [14:20:27] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2005-dev - aborrero@cumin2002" [14:20:27] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:20:33] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:20:58] !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudservices2005-dev.private.codfw.wikimedia.cloud on all recursors [14:21:01] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudservices2005-dev.private.codfw.wikimedia.cloud on all recursors [14:21:01] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [14:21:11] (03PS4) 10Jcrespo: dbbackups: Make mydumper backup template path configurable [puppet] - 10https://gerrit.wikimedia.org/r/931940 (https://phabricator.wikimedia.org/T339894) [14:22:34] (03PS2) 10Hnowlan: swift: add logging for when private connections are used [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/931896 (https://phabricator.wikimedia.org/T338765) [14:22:53] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:23:36] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [14:23:36] (03Abandoned) 10JHathaway: dev env: have ssh server use the dev environment ssh configs [puppet] - 10https://gerrit.wikimedia.org/r/928664 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:24:09] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:24:44] (03PS5) 10Jcrespo: dbbackups: Make mydumper backup template path configurable [puppet] - 10https://gerrit.wikimedia.org/r/931940 (https://phabricator.wikimedia.org/T339894) [14:25:33] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:27:10] (03CR) 10Volans: sre.puppet.sync-netbox-hiera: Add platform (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [14:27:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931633 (owner: 10Muehlenhoff) [14:28:27] (03PS2) 10JHathaway: dev env: sshd, allow for user CA based auth [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) [14:29:23] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [14:29:24] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1024.eqiad.wmnet with OS bullseye [14:29:31] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dbproxy1024.eqiad.wmnet with OS bullseye completed: - dbproxy1024 (**... [14:29:41] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:29:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10RobH) [14:30:53] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Make mydumper backup template path configurable [puppet] - 10https://gerrit.wikimedia.org/r/931940 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [14:30:57] (03CR) 10Jcrespo: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/931940/41887/backup2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/931940 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [14:31:47] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10RobH) dbproxy1024's network settings were a bit off, so rather than try to figure out why, i just dumped the primary interface out and re-ran the netbox network... [14:31:50] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:32:37] (03CR) 10Jbond: [C: 03+1] "make senses and lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:33:51] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [14:33:51] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2002.codfw.wmnet [14:36:03] (03CR) 10Jbond: [C: 03+1] Switch all uses priority of ferm::service to numeric values [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:37:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [14:39:06] (03CR) 10Ottomata: [C: 03+2] charts/eventgate - Use mesh.networkpolicy.egress and base.networkpolicy.egress.kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T335024) (owner: 10Giuseppe Lavagetto) [14:39:42] (03PS1) 10Ssingh: O:dnsbox: clean-up service binding for pdns-rec/gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/931945 [14:40:09] (03CR) 10Jbond: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/931895 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [14:40:46] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [14:40:46] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2002.codfw.wmnet [14:42:23] (03Merged) 10jenkins-bot: charts/eventgate - Use mesh.networkpolicy.egress and base.networkpolicy.egress.kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T335024) (owner: 10Giuseppe Lavagetto) [14:42:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [14:42:55] (03PS4) 10Hnowlan: trafficserver: route proton requests via the API gateway [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) [14:43:04] (03PS1) 10Vgutierrez: haproxy: Let set port 80 timeouts via hiera [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) [14:43:08] (03PS9) 10Ottomata: eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [14:44:04] (03PS2) 10Vgutierrez: haproxy: Let set port 80 timeouts via hiera [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) [14:44:10] (03CR) 10CI reject: [V: 04-1] eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [14:44:39] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:44:41] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:45:33] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41889/console" [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [14:46:11] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:46:11] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:46:46] (03CR) 10Dzahn: [C: 03+2] httpbb: move tests for transparency.wikimedia.org to miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/931897 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [14:47:09] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [14:47:09] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2002.codfw.wmnet [14:47:48] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [14:47:56] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2002.codfw.wmnet [14:48:37] (03CR) 10Dzahn: [C: 03+2] traficserver: switch miscweb transparency backend [puppet] - 10https://gerrit.wikimedia.org/r/931894 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [14:49:11] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [14:49:20] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [14:49:21] 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) 05Open→03In progress Hey @Papaul (or @Jhancock.wm ) we would need this server connected in a similar fas... [14:49:59] (03CR) 10Dzahn: [C: 03+2] miscweb: move probes for transparency to profile::microsites::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/931898 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [14:50:00] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [14:50:17] (03PS3) 10Vgutierrez: haproxy: Let set port 80 timeouts via hiera [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) [14:50:19] (03PS1) 10Vgutierrez: hiera: Set stricter timeouts for port 80 on esams [puppet] - 10https://gerrit.wikimedia.org/r/931948 (https://phabricator.wikimedia.org/T339898) [14:50:51] (03PS5) 10ArielGlenn: correct and expand docs for dumps nfs share testing [puppet] - 10https://gerrit.wikimedia.org/r/931893 (https://phabricator.wikimedia.org/T325232) [14:51:35] (03CR) 10ArielGlenn: [C: 03+2] correct and expand docs for dumps nfs share testing [puppet] - 10https://gerrit.wikimedia.org/r/931893 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [14:51:37] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41890/console" [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [14:53:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931903 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:53:07] (03PS4) 10Vgutierrez: haproxy: Let set port 80 timeouts via hiera [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) [14:53:09] (03PS2) 10Vgutierrez: hiera: Set stricter timeouts for port 80 on esams [puppet] - 10https://gerrit.wikimedia.org/r/931948 (https://phabricator.wikimedia.org/T339898) [14:53:11] (03PS1) 10Hashar: beta: avoid erasing extensions when already present [puppet] - 10https://gerrit.wikimedia.org/r/931949 (https://phabricator.wikimedia.org/T340030) [14:53:13] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:53:20] (03PS2) 10Hashar: beta: avoid erasing extensions when already present [puppet] - 10https://gerrit.wikimedia.org/r/931949 (https://phabricator.wikimedia.org/T340030) [14:54:19] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41891/console" [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [14:54:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931937 (owner: 10Slyngshede) [14:55:36] (03CR) 10Hashar: "Details at T340030#8953010" [puppet] - 10https://gerrit.wikimedia.org/r/931949 (https://phabricator.wikimedia.org/T340030) (owner: 10Hashar) [14:56:11] (03PS2) 10Ssingh: O:dnsbox: clean-up service binding for pdns-rec/gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/931945 [14:56:35] (03CR) 10Dzahn: [C: 03+2] "merged, ran puppet on cp4*, opened the URLs, confirmed no more log entries on miscweb1003. looks good to me:) thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/931894 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [14:56:59] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T339884 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:57:09] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41892/console" [puppet] - 10https://gerrit.wikimedia.org/r/931945 (owner: 10Ssingh) [14:57:47] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41893/console" [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [14:57:49] (03PS10) 10Ottomata: eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [14:58:36] (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [15:00:01] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:00:01] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:01:32] (03PS5) 10Vgutierrez: haproxy: Let set port 80 timeouts via hiera [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) [15:01:34] (03PS3) 10Vgutierrez: hiera: Set stricter timeouts for port 80 on esams [puppet] - 10https://gerrit.wikimedia.org/r/931948 (https://phabricator.wikimedia.org/T339898) [15:03:07] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans) [15:03:36] !log depooling sessionstore/codfw — T340043 [15:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:40] T340043: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 [15:03:46] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route depool sessionstore in codfw: maintenance [15:04:19] (03CR) 10BCornwall: [C: 03+1] haproxy: Let set port 80 timeouts via hiera [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [15:04:51] (03PS2) 10Jcrespo: openstack: pdns: add backup for the database [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [15:06:45] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [15:07:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931901 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:07:16] (03PS3) 10Jcrespo: openstack: pdns: add backup for the database [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [15:08:50] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool sessionstore in codfw: maintenance [15:09:09] !log installing joblib security updates [15:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:30] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans) [15:09:38] (03CR) 10Hnowlan: [C: 03+2] swift: add logging for when private connections are used [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/931896 (https://phabricator.wikimedia.org/T338765) (owner: 10Hnowlan) [15:10:06] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41894/console" [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [15:10:21] (03CR) 10Vgutierrez: [C: 03+1] "looks good, please merge it with puppet disabled on A:cp and double check that everything works as expected on a depooled host" [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [15:10:23] (03PS4) 10Jcrespo: openstack: pdns: add backup for the database [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [15:10:41] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:10:41] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:12:51] (03CR) 10Ssingh: [C: 03+1] haproxy: Let set port 80 timeouts via hiera [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [15:13:08] (03PS5) 10Jcrespo: openstack: pdns: add backup for the database [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [15:13:44] 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10Jhancock.wm) @aborrero I've physically removed the connection from asw. the cloudsw connection is now on port ge-0/0/1... [15:13:48] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Let set port 80 timeouts via hiera [puppet] - 10https://gerrit.wikimedia.org/r/931947 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [15:13:54] (03Merged) 10jenkins-bot: swift: add logging for when private connections are used [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/931896 (https://phabricator.wikimedia.org/T338765) (owner: 10Hnowlan) [15:13:59] (03PS3) 10Hashar: beta: avoid erasing extensions when already present [puppet] - 10https://gerrit.wikimedia.org/r/931949 (https://phabricator.wikimedia.org/T340030) [15:14:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931695 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:14:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [15:14:51] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [15:15:07] (03CR) 10JHathaway: [C: 03+2] dev env: rsyslog exporter, in container env listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/931695 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:15:12] (03CR) 10Jbond: [C: 03+1] sshd: don't add AuthorizedKeysFile when we have no keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931693 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:15:51] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/931948/41896/cp3050.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/931948 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [15:16:34] (03CR) 10Ssingh: [C: 03+1] hiera: Set stricter timeouts for port 80 on esams [puppet] - 10https://gerrit.wikimedia.org/r/931948 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [15:16:43] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:16:45] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:17:46] (03PS4) 10Vgutierrez: hiera: Set stricter timeouts for port 80 on esams [puppet] - 10https://gerrit.wikimedia.org/r/931948 (https://phabricator.wikimedia.org/T339898) [15:18:21] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqiad [15:19:25] !log installing php7.3 security updates [15:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:31] (03CR) 10Ssingh: [C: 03+1] hiera: Set stricter timeouts for port 80 on esams [puppet] - 10https://gerrit.wikimedia.org/r/931948 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [15:20:01] 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiff to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10ayounsi) [15:20:11] 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10ayounsi) [15:20:14] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqiad [15:20:17] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [15:20:32] what's uppp [15:20:55] ACKed, looking [15:20:57] again? [15:20:58] (03CR) 10Hashar: "I have cherry picked it to the puppet master, had to fix [ which was missing the path and replaced it with /usr/bin/test in PS3." [puppet] - 10https://gerrit.wikimedia.org/r/931949 (https://phabricator.wikimedia.org/T340030) (owner: 10Hashar) [15:21:01] maybe we should depool codfw? [15:21:31] 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) >>! In T338779#8953114, @Jhancock.wm wrote: > @aborrero I've physically removed the connection from asw. the... [15:22:07] +! [15:22:23] vgutierrez: codfw gets a pretty small share of traffic, and the NEL alert pretty directly measures user impact, so I agree [15:22:39] (03PS1) 10Ssingh: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/931952 [15:23:03] (03CR) 10Vgutierrez: [C: 03+1] depool codfw [dns] - 10https://gerrit.wikimedia.org/r/931952 (owner: 10Ssingh) [15:23:10] sukhe: thx <3 [15:23:12] I have no yet started my upgrade, should I hold off? [15:23:12] shall I stop the rolling reboots of cp in eqiad? [15:23:30] brett: I think it's best to stop if you can yes [15:23:38] (03CR) 10Ssingh: [C: 03+2] depool codfw [dns] - 10https://gerrit.wikimedia.org/r/931952 (owner: 10Ssingh) [15:23:39] sessionstore/codfw is depooled, but I haven't yet proceeded otherwise [15:23:42] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-text_eqiad [15:23:44] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-upload_eqiad [15:23:59] (03CR) 10Ahmon Dancy: [C: 03+1] beta: avoid erasing extensions when already present [puppet] - 10https://gerrit.wikimedia.org/r/931949 (https://phabricator.wikimedia.org/T340030) (owner: 10Hashar) [15:24:04] !log run authdns-update to depool cofw [15:24:05] urandom: not related IMHO :) [15:24:06] !log run authdns-update to depool codfw [15:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:07] cp1075/1076 are still in the process of rebooting but should be up soon [15:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:18] brett: thanks [15:24:39] brett: if you cancelled the cookbook in the middle of that reboot you'd need to pool them back manually I guess [15:24:50] d'oh, you're right [15:24:55] one cp in each cluster should be fine, I would just continue [15:25:17] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [15:25:32] sure, buut this is not critical work fwiw [15:25:41] (the reboots) [15:25:54] just essential work(TM) [15:25:59] !incident [15:26:00] !incidents [15:26:01] 3820 (RESOLVED) NELHigh sre (tcp.timed_out) [15:26:01] 3819 (RESOLVED) NELHigh sre (tcp.timed_out) [15:26:01] 3812 (RESOLVED) ProbeDown sre (2620:0:862:ed1a::2 ip6 text:80 probes/service http_text_ip6 esams) [15:26:33] (03CR) 10Hashar: [C: 03+1] beta: avoid erasing extensions when already present [puppet] - 10https://gerrit.wikimedia.org/r/931949 (https://phabricator.wikimedia.org/T340030) (owner: 10Hashar) [15:27:03] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS bullseye [15:27:10] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host sessionstore2001.codfw.wmnet with OS bullseye [15:29:01] 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10Jhancock.wm) [15:29:35] 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10Jhancock.wm) updated! [15:33:12] (03CR) 10Vgutierrez: [C: 03+2] hiera: Set stricter timeouts for port 80 on esams [puppet] - 10https://gerrit.wikimedia.org/r/931948 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [15:33:17] PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:34:35] RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 260.91 ms [15:34:56] (03CR) 10Ahmon Dancy: git::clone: Add types docs and minor updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931895 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [15:35:26] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/931949 (https://phabricator.wikimedia.org/T340030) (owner: 10Hashar) [15:35:41] 10SRE-tools, 10DBA, 10Infrastructure-Foundations: Create a cookbook for cloning a mariadb database into another - https://phabricator.wikimedia.org/T340048 (10Ladsgroup) [15:35:56] 10SRE-tools, 10DBA, 10Infrastructure-Foundations: Create a cookbook for cloning a mariadb database into another - https://phabricator.wikimedia.org/T340048 (10Ladsgroup) p:05Triage→03Medium [15:36:03] (03PS3) 10Ssingh: O:dnsbox: clean-up service binding for pdns-rec/gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/931945 [15:36:45] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:36:47] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:37:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:38:15] (03PS1) 10Ahmon Dancy: git::clone: Fix -B parameter [puppet] - 10https://gerrit.wikimedia.org/r/931960 (https://phabricator.wikimedia.org/T290260) [15:39:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [15:39:31] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:39:33] PROBLEM - SSH on cloudbackup2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:39:45] (03PS1) 10Ladsgroup: mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) [15:40:21] (03CR) 10Ahmon Dancy: git::clone: Add types docs and minor updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931895 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [15:40:23] (03CR) 10CI reject: [V: 04-1] git::clone: Fix -B parameter [puppet] - 10https://gerrit.wikimedia.org/r/931960 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [15:40:27] (03PS1) 10Jbond: git::clone: fix branch selection [puppet] - 10https://gerrit.wikimedia.org/r/931962 [15:41:33] (03PS1) 10Hnowlan: thumbor: bump chart, swift private debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/931963 (https://phabricator.wikimedia.org/T338765) [15:42:31] 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10MoritzMuehlenhoff) I'll look into dictdiffer [15:42:35] (03CR) 10CI reject: [V: 04-1] git::clone: fix branch selection [puppet] - 10https://gerrit.wikimedia.org/r/931962 (owner: 10Jbond) [15:42:38] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=cdn [15:42:38] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=ats-be [15:42:45] (03CR) 10CI reject: [V: 04-1] mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) (owner: 10Ladsgroup) [15:42:49] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1076.eqiad.wmnet,service=cdn [15:42:50] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1076.eqiad.wmnet,service=ats-be [15:43:02] (03PS2) 10Jbond: git::clone: Fix -B parameter [puppet] - 10https://gerrit.wikimedia.org/r/931960 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [15:43:19] (03CR) 10Jbond: "thanks i just sent a fix to the spec test will merge when CI is green" [puppet] - 10https://gerrit.wikimedia.org/r/931960 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [15:43:27] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump chart, swift private debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/931963 (https://phabricator.wikimedia.org/T338765) (owner: 10Hnowlan) [15:43:49] PROBLEM - Host cloudbackup2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:10] (03Merged) 10jenkins-bot: thumbor: bump chart, swift private debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/931963 (https://phabricator.wikimedia.org/T338765) (owner: 10Hnowlan) [15:44:19] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@d9a9135]: (no justification provided) [15:44:29] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@d9a9135]: (no justification provided) (duration: 00m 09s) [15:45:27] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2028.* [15:45:29] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:45:40] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:46:10] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3053.* [15:47:27] (03PS5) 10AOkoth: vrts: post decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) [15:47:39] (03PS4) 10Ssingh: O:dnsbox: clean-up service binding for pdns-rec/gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/931945 [15:47:57] was anyone working on cloudbackup2001? [15:48:01] (03CR) 10AOkoth: vrts: post decom cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) (owner: 10AOkoth) [15:48:09] host is down [15:49:25] RECOVERY - Host cloudbackup2001 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [15:49:30] (03PS2) 10Ladsgroup: mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) [15:49:34] ok [15:51:20] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [15:51:47] RECOVERY - SSH on cloudbackup2001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:52:22] (03CR) 10Ahmon Dancy: [C: 03+1] git::clone: Fix -B parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931960 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [15:52:24] (03CR) 10CI reject: [V: 04-1] mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) (owner: 10Ladsgroup) [15:53:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:31] (03CR) 10Jbond: [C: 03+2] git::clone: Fix -B parameter [puppet] - 10https://gerrit.wikimedia.org/r/931960 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [15:55:21] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:55:25] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:55:39] (03PS1) 10Ilias Sarantopoulos: ml-services: enabled quantization of llms [deployment-charts] - 10https://gerrit.wikimedia.org/r/931965 [15:55:43] (03PS3) 10BCornwall: pybal: Fix hostnames not being sent on alert [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) [15:56:36] (03CR) 10BCornwall: pybal: Fix hostnames not being sent on alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [15:57:00] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: enabled quantization of llms [deployment-charts] - 10https://gerrit.wikimedia.org/r/931965 (owner: 10Ilias Sarantopoulos) [15:57:02] (03CR) 10Jameel Kaisar: Probenet: Configure NetworkProbeLimit to get adequate data for each country (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [15:57:15] PROBLEM - Check systemd state on cloudbackup2001 is CRITICAL: CRITICAL - degraded: The following units failed: dm-event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:37] (03Abandoned) 10Jameel Kaisar: Probenet: Configure NetworkProbeLimit to get adequate data for each country [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [15:57:47] (03Merged) 10jenkins-bot: ml-services: enabled quantization of llms [deployment-charts] - 10https://gerrit.wikimedia.org/r/931965 (owner: 10Ilias Sarantopoulos) [15:58:17] (03PS1) 10Dzahn: microsites: delete transparency.wikimedia.org classes [puppet] - 10https://gerrit.wikimedia.org/r/931966 (https://phabricator.wikimedia.org/T338781) [15:58:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:56] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetmaster2001.codfw.wmnet [15:59:19] (03PS3) 10Ladsgroup: mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) [15:59:22] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41901/console" [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [16:00:16] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:01:30] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1003.eqiad.wmnet [16:02:16] (03CR) 10CI reject: [V: 04-1] mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) (owner: 10Ladsgroup) [16:02:18] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: OpenSent - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:02:27] (03Abandoned) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [16:02:55] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1003.eqiad.wmnet [16:03:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) (owner: 10AOkoth) [16:03:20] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:03:27] (03CR) 10Jelto: [C: 03+1] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/931966 (https://phabricator.wikimedia.org/T338781) (owner: 10Dzahn) [16:03:43] (03PS2) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [16:03:46] (03PS1) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [16:04:15] (03CR) 10CI reject: [V: 04-1] openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) (owner: 10Arturo Borrero Gonzalez) [16:05:52] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:05:56] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:06:32] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia, AS1299/IPv6: OpenSent - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:07:03] (03PS5) 10Ssingh: O:dnsbox: clean-up service binding for pdns-rec/gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/931945 [16:07:38] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:08:35] (03PS4) 10Ladsgroup: mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) [16:09:14] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:09:16] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:11:11] (03PS2) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [16:11:13] (03PS3) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [16:13:39] (03PS6) 10Ssingh: O:dnsbox: clean-up service binding for pdns-rec/gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/931945 [16:13:55] (03PS3) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [16:13:57] (03PS4) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [16:14:41] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41904/console" [puppet] - 10https://gerrit.wikimedia.org/r/931945 (owner: 10Ssingh) [16:15:38] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2001.codfw.wmnet with OS bullseye [16:15:45] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host sessionstore2001.codfw.wmnet with OS bullseye executed with errors: - sessi... [16:19:31] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41906/console" [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [16:24:52] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:26:32] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:29:19] (03PS4) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [16:29:21] (03PS5) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [16:29:26] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:31:08] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:32:11] (03CR) 10Ottomata: "Going to merge and carefully apply in staging first." [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [16:32:30] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:32:40] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:32:52] (03CR) 10Ottomata: [C: 03+2] eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [16:34:00] (03Merged) 10jenkins-bot: eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [16:36:53] 10SRE, 10Wikimedia-Etherpad, 10serviceops-collab: Upgrade etherpad.wikimedia.org to v1.9.0 - https://phabricator.wikimedia.org/T316421 (10LSobanski) p:05Low→03Medium [16:37:10] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:37:20] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:37:43] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1003.eqiad.wmnet [16:37:55] 10SRE, 10Wikimedia-Etherpad, 10serviceops-collab: Upgrade etherpad.wikimedia.org to v1.9.0 - https://phabricator.wikimedia.org/T316421 (10LSobanski) I updated the description to reflect the new Etherpad release. This one has a larger list of changes (see below). What's Changed [Snyk] Upgrade rate-limit... [16:38:03] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1003.eqiad.wmnet [16:38:54] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:39:00] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:39:25] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1003.eqiad.wmnet [16:39:40] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:39:43] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:39:49] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:39:53] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1003.eqiad.wmnet [16:40:09] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:42:10] (03PS1) 10Ottomata: eventgate-logging-external/values-staging.yaml - update kafka brokers list [deployment-charts] - 10https://gerrit.wikimedia.org/r/931975 [16:42:24] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:42:29] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:42:58] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:43:02] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:43:24] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:43:34] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:44:12] (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external/values-staging.yaml - update kafka brokers list [deployment-charts] - 10https://gerrit.wikimedia.org/r/931975 (owner: 10Ottomata) [16:45:32] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [16:45:45] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [16:47:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [16:47:35] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:47:37] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [16:49:20] (03PS1) 10Jgreen: Change incoming mail route for civicrm.wikimedia.org for new civicrm server. [puppet] - 10https://gerrit.wikimedia.org/r/931978 (https://phabricator.wikimedia.org/T329882) [16:54:50] (03PS5) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [16:55:24] (03CR) 10CI reject: [V: 04-1] openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) (owner: 10Arturo Borrero Gonzalez) [16:55:48] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:56:00] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:56:21] (03PS6) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [16:56:23] (03PS6) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [16:56:50] (03CR) 10Dwisehaupt: [C: 03+2] Change incoming mail route for civicrm.wikimedia.org for new civicrm server. [puppet] - 10https://gerrit.wikimedia.org/r/931978 (https://phabricator.wikimedia.org/T329882) (owner: 10Jgreen) [16:57:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [16:58:16] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia, AS1299/IPv6: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230621T1700) [17:01:50] !log sudo ipmitool -I lanplus -H "sessionstore2001.mgmt.codfw.wmnet" -U root -E chassis power reset [17:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:13] (03CR) 10Volans: "I did a first pass, left some general comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) (owner: 10Ladsgroup) [17:03:22] (03PS7) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [17:03:55] (03PS7) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [17:05:58] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:06:10] (03CR) 10Arturo Borrero Gonzalez: [V: 04-1] "PCC fails to build: https://puppet-compiler.wmflabs.org/output/931964/41909/" [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) (owner: 10Arturo Borrero Gonzalez) [17:06:48] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:07:47] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) [17:08:08] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:10:21] (03PS1) 10Ottomata: ~/otto/.bashrc - add an henv kube_env shortcut function [puppet] - 10https://gerrit.wikimedia.org/r/931982 [17:13:49] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [17:14:22] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [17:16:11] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [17:16:32] (03CR) 10Ottomata: [C: 03+2] ~/otto/.bashrc - add an henv kube_env shortcut function [puppet] - 10https://gerrit.wikimedia.org/r/931982 (owner: 10Ottomata) [17:16:44] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [17:17:20] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:17:36] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:18:25] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [17:18:41] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [17:20:13] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [17:20:26] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:21:10] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:21:11] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:21:13] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [17:21:14] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:21:21] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:22:02] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [17:22:35] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:22:42] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [17:23:11] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:23:26] RECOVERY - Check systemd state on wdqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:30] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [17:23:52] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [17:24:46] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [17:25:04] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:25:26] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:25:54] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [17:27:12] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [17:27:54] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [17:28:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [17:28:44] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [17:29:12] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [17:30:04] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:30:46] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:30:49] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:32:17] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/929749 (https://phabricator.wikimedia.org/T335610) (owner: 10Cwhite) [17:32:52] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:33:12] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:34:19] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [17:35:19] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [17:36:43] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:37:14] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:37:28] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [17:39:02] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:39:24] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:39:43] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [17:39:50] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:40:02] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:40:16] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:42:52] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:43:28] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:44:11] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:44:17] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:44:38] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:45:30] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) Status update: networkpolicy for Kafka brokers has been DRY, but referencing the hostnames for Kafka brokers for... [17:46:46] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:51:22] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:52:58] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:53:18] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:57:34] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:57:56] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:59:48] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10Jdlrobson) @ssingh I + @sgrabarczuk can sponsor this. What do you need? [17:59:53] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [17:59:55] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [18:00:02] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [18:00:06] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230621T1800) [18:00:25] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [18:01:45] (03CR) 10Dzahn: [C: 03+2] microsites: delete transparency.wikimedia.org classes [puppet] - 10https://gerrit.wikimedia.org/r/931966 (https://phabricator.wikimedia.org/T338781) (owner: 10Dzahn) [18:02:48] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10sgrabarczuk) I confirm, I may be a sponsor. [18:03:08] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:03:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [18:05:11] (03CR) 10Dzahn: [C: 03+2] "after running puppet on miscweb* a "apache2ctl -S" shows that the name virtual host is actually removed without a hard apache restart." [puppet] - 10https://gerrit.wikimedia.org/r/931966 (https://phabricator.wikimedia.org/T338781) (owner: 10Dzahn) [18:06:55] !log miscweb1003/miscweb2003 - rm -rf /srv/org/wikimedia/TransparencyReport T338781 [18:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:00] T338781: move micro site transparency.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T338781 [18:08:20] !log miscweb1003/miscweb2003 - rm -rf /srv/org/wikimedia/bienvenida T337047 [18:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:24] T337047: move micro site bienvenida.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T337047 [18:09:54] !log miscweb1003/miscweb2003 - rm -rf /srv/org/wikimedia/annualreport T337041 [18:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:58] T337041: move micro site annual.wikimedia.org and 15.wikipedia.org to kubernetes - https://phabricator.wikimedia.org/T337041 [18:12:18] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [18:12:19] !log miscweb1003/miscweb2003 - rm -rf /srv/org/wikimedia/racktables T327405 [18:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:24] T327405: Decommission Racktables - https://phabricator.wikimedia.org/T327405 [18:12:27] (03PS1) 10Jgreen: Switch payments-listener-eqiad to the new server/ip. [dns] - 10https://gerrit.wikimedia.org/r/931985 (https://phabricator.wikimedia.org/T319460) [18:12:42] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [18:13:13] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10ssingh) @Jdlrobson, @sgrabarczuk: just the comment here is enough, I will process this shortly. Thanks. [18:13:19] (03CR) 10CI reject: [V: 04-1] Switch payments-listener-eqiad to the new server/ip. [dns] - 10https://gerrit.wikimedia.org/r/931985 (https://phabricator.wikimedia.org/T319460) (owner: 10Jgreen) [18:13:54] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [18:14:23] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [18:16:36] (03PS2) 10Jgreen: Switch payments-listener-eqiad to the new server/ip. [dns] - 10https://gerrit.wikimedia.org/r/931985 (https://phabricator.wikimedia.org/T319460) [18:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:18:04] (03CR) 10Dwisehaupt: [C: 03+2] "Looks good. Let's do this." [dns] - 10https://gerrit.wikimedia.org/r/931985 (https://phabricator.wikimedia.org/T319460) (owner: 10Jgreen) [18:24:41] !log sudo ipmitool -I lanplus -H "sessionstore2001.mgmt.codfw.wmnet" -U root -E mc reset cold [18:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:33] 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet iDRAC issues - https://phabricator.wikimedia.org/T340055 (10Eevans) [18:27:54] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp1077.eqiad.wmnet,cp1079.eqiad.wmnet,cp1081.eqiad.wmnet,cp1083.eqiad.wmnet,cp1085.eqiad.wmnet,cp1087.eqiad.wmnet,cp1089.eqiad.wmnet} and A:cp [18:29:44] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp1078.eqiad.wmnet,cp1080.eqiad.wmnet,cp1082.eqiad.wmnet,cp1084.eqiad.wmnet,cp1086.eqiad.wmnet,cp1088.eqiad.wmnet,cp1090.eqiad.wmnet} and A:cp [18:32:19] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet iDRAC issues - https://phabricator.wikimedia.org/T340055 (10ssingh) The iDRAC firmware in this case was 3.30, so in discussion with @Eevans, I upgraded it to 6.00, having done so many times for the other Traffic R440s. We then tried to run the cookbo... [18:37:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) [18:44:22] (03PS5) 10Ladsgroup: mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) [18:44:27] (03CR) 10Ladsgroup: mysql: Introduce sre.mysql.clone (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) (owner: 10Ladsgroup) [18:46:57] (03PS4) 10Majavah: Added extended confirmed on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [18:47:14] (03CR) 10CI reject: [V: 04-1] mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) (owner: 10Ladsgroup) [18:48:55] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:48:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1025.eqiad.wmnet with OS bullseye [18:49:02] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye completed: - dbproxy1025 (... [18:50:26] (03PS6) 10Ladsgroup: mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) [18:52:30] (03PS1) 10Ssingh: admin: add eccenux to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/931989 (https://phabricator.wikimedia.org/T337121) [18:56:36] (03CR) 10Dzahn: [C: 03+1] "lgtm, matches email address I see in LDAP, ticket has sponsor comment" [puppet] - 10https://gerrit.wikimedia.org/r/931989 (https://phabricator.wikimedia.org/T337121) (owner: 10Ssingh) [18:57:08] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "per KFrancis: https://phabricator.wikimedia.org/T337121#8916613" [puppet] - 10https://gerrit.wikimedia.org/r/931989 (https://phabricator.wikimedia.org/T337121) (owner: 10Ssingh) [19:00:07] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [19:00:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Jclark-ctr) updated firmware. Started cpu stress test per dell request has ran for 24 hours with no errors on tsr report [19:00:40] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [19:01:37] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [19:01:55] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [19:02:28] (03CR) 10Ssingh: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/931989 (https://phabricator.wikimedia.org/T337121) (owner: 10Ssingh) [19:02:30] (03CR) 10Ssingh: [C: 03+2] admin: add eccenux to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/931989 (https://phabricator.wikimedia.org/T337121) (owner: 10Ssingh) [19:02:34] (03CR) 10Dzahn: [C: 03+2] site: add buster people VMs to insetup role for decom [puppet] - 10https://gerrit.wikimedia.org/r/931699 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [19:02:46] (03PS3) 10Dzahn: site: add buster people VMs to insetup role for decom [puppet] - 10https://gerrit.wikimedia.org/r/931699 (https://phabricator.wikimedia.org/T338827) [19:03:41] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [19:03:49] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:03:52] (03Restored) 10BCornwall: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [19:03:58] (03PS4) 10Dzahn: site: add buster people VMs to insetup role for decom [puppet] - 10https://gerrit.wikimedia.org/r/931699 (https://phabricator.wikimedia.org/T338827) [19:08:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:17] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet iDRAC issues - https://phabricator.wikimedia.org/T340055 (10ssingh) @Jhancock.wm fixed the issue by running `racadm set IDRAC.WebServer.HostHeaderCheck 0` on the host. We can access the HTTP mgmt interface now. The cookbook however still fails with... [19:14:12] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10ssingh) Added to `nda`. Resolving this, please open if there are any issues. [19:14:17] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10ssingh) 05In progress→03Resolved [19:15:56] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:22] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet iDRAC issues - https://phabricator.wikimedia.org/T340055 (10ssingh) I suspect a BIOS upgrade is also in order, though I am not going to attempt that without further input just yet. [19:22:31] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet iDRAC issues - https://phabricator.wikimedia.org/T340055 (10Dzahn) ` 18:55 < JennH> oh this. papaul showed me a trick to fix this. can you reach it now? 18:55 < sukhe> let's try! 18:55 < sukhe> indeed! 18:55 < sukhe> thanks JennH, that worked! 18:55... [19:24:44] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [19:24:52] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:25:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:41] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [19:28:49] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:30:32] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [19:31:00] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet iDRAC issues - https://phabricator.wikimedia.org/T340055 (10Jhancock.wm) @ssingh I might be mistaken, but I think that firmware update is just for 10G NIC cards [19:31:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:34:04] (03PS1) 10Jameel Kaisar: Update mappings for subregions of CA/US based on the Probenet data [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) [19:34:04] 10SRE, 10Infrastructure-Foundations, 10Traffic: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318 (10JameelKaisar) Update: - Instead of trimming bottom 10 %, we are trimming bottom 5 % only. - We are plotting Box plots as we... [19:36:26] @seen kindrobot [19:37:39] (03CR) 10Jameel Kaisar: [C: 03+1] Update mappings for subregions of CA/US based on the Probenet data [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [19:37:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:51] !seen kindrobot [19:45:05] I keep forgetting the seen command but we definitely had it.. mwbot.? [19:45:12] wm-bot: help [19:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:15] mutante: i think seen command is enabled per channel [19:47:18] Zppix: was it wm-bot that was in a lot of channels and then told you "user was last seen quitting #wikimedia-... ", so you knew which of the channels they used [19:47:50] I know it said when last seen, i dont know what other info it provided [19:48:13] what was the syntax though :) [19:48:19] not @ or ! [19:48:26] @ [19:48:55] doesn't work on -tech either .. hmm [19:49:16] You may have to do @seen-on iirc (if you have the access to do so) [19:49:17] thanks Zppix, might have been disabled [19:51:15] mutante: this is the last entry I see grepping my log files: ./#wikimedia-operations/2023-06-17.log:82:[09:13:09] *** Quits: kindrobot [19:53:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:43] taavi: oh, thank you! and I found them . resolved :) [19:56:03] (03PS1) 10Kosta Harlan: Section images: Select placeholder when inserting it [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931712 (https://phabricator.wikimedia.org/T335209) [19:57:00] (03CR) 10Dzahn: "before merged I did a check of "files modified in /home in the last 7 days" because new VMs have uptime of 7 days. checked with one user s" [puppet] - 10https://gerrit.wikimedia.org/r/931699 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [19:59:04] !log people.wikimedia.org - disabling shell access to people1003/people2002 (bullseye), use people1004/people2002 (bookworm) or people.eqiad.wmnet / people.codfw.wmnet in your configs if you have something automated or git repos - T338827 [19:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:09] T338827: upgrade people VMs to bookworm - https://phabricator.wikimedia.org/T338827 [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230621T2000). Please do the needful. [20:00:06] kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] hi, i'm here, and backporting it [20:00:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931712 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan) [20:00:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts people1003.eqiad.wmnet [20:04:25] !log deleting VMs people1003.eqiad.wmnet and people2002.codfw.wmnet T338827 [20:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:29] T338827: upgrade people VMs to bookworm - https://phabricator.wikimedia.org/T338827 [20:08:36] (03PS3) 10Dzahn: gerrit: backup /home on gerrit servers in Bacula [puppet] - 10https://gerrit.wikimedia.org/r/931680 (https://phabricator.wikimedia.org/T336427) [20:08:56] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Ladsgroup) Let me know if you come up with another name and I will create it for you. Also you can keep an eye on https://meta.wikimedia.org/wiki/Mailing_lists/Standardization for ideas. [20:09:24] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [20:11:25] (03CR) 10Dzahn: "root@gerrit1003:/# du -hs /home/" [puppet] - 10https://gerrit.wikimedia.org/r/931680 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [20:11:34] (03PS1) 10Cwhite: hiera: map logstash.wm.o to kibana7.codfw [puppet] - 10https://gerrit.wikimedia.org/r/931911 (https://phabricator.wikimedia.org/T333732) [20:11:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:42] (03PS4) 10Dzahn: gerrit: backup /home on gerrit servers in Bacula [puppet] - 10https://gerrit.wikimedia.org/r/931680 (https://phabricator.wikimedia.org/T336427) [20:13:31] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [20:13:33] (03Abandoned) 10Kosta Harlan: Section images: Select placeholder when inserting it [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931712 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan) [20:13:37] (03Restored) 10Kosta Harlan: Section images: Select placeholder when inserting it [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931712 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan) [20:13:50] (03CR) 10Dzahn: "as agreed in today's meeting, I will make a tarball of home on gerrit1001 and copy it to gerrit1003. then on the new prod servers /home wi" [puppet] - 10https://gerrit.wikimedia.org/r/931680 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [20:14:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931712 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan) [20:14:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [20:14:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:14:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts people1003.eqiad.wmnet [20:15:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts people2002.codfw.wmnet [20:16:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:32] (03PS5) 10Dzahn: gerrit: backup /home on gerrit servers in Bacula [puppet] - 10https://gerrit.wikimedia.org/r/931680 (https://phabricator.wikimedia.org/T336427) [20:20:34] (03PS1) 10Dzahn: site: remove decom'ed people.wikimedia.org backends [puppet] - 10https://gerrit.wikimedia.org/r/931999 (https://phabricator.wikimedia.org/T338827) [20:21:02] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:04] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/931680/41910/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/931680 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [20:22:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:16] (03CR) 10Dzahn: [C: 03+2] "no change needed besides that because "home" is already standard fileset that includes.. well.. /home" [puppet] - 10https://gerrit.wikimedia.org/r/931680 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [20:27:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:30:34] !log gerrit1001 (formerly gerrit prod) - creating tarball of entire /home/ in /home/ and copying it over to gerrit1003 - simultaneousy adding /home on gerrit servers to bacula from now on - T336427 [20:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:40] T336427: decom gerrit1001 - https://phabricator.wikimedia.org/T336427 [20:30:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:36] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [20:38:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:14] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [20:41:43] (03Merged) 10jenkins-bot: Section images: Select placeholder when inserting it [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931712 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan) [20:42:11] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:931712|Section images: Select placeholder when inserting it (T335209)]] [20:42:15] T335209: Section-level images: suggestions mode - https://phabricator.wikimedia.org/T335209 [20:43:41] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:931712|Section images: Select placeholder when inserting it (T335209)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:45:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [20:45:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:45:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts people2002.codfw.wmnet [20:46:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:37] (03CR) 10Dzahn: "deleting bullseye VMs" [puppet] - 10https://gerrit.wikimedia.org/r/931999 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [20:47:59] (03CR) 10Dzahn: [C: 03+2] site: remove decom'ed people.wikimedia.org backends [puppet] - 10https://gerrit.wikimedia.org/r/931999 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [20:52:33] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:931712|Section images: Select placeholder when inserting it (T335209)]] (duration: 10m 21s) [20:52:37] T335209: Section-level images: suggestions mode - https://phabricator.wikimedia.org/T335209 [20:52:58] !log UTC late deploys done [20:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:54:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:23] (03PS1) 10Jgreen: Switch civicrm.wm.o to the new server. [dns] - 10https://gerrit.wikimedia.org/r/932004 (https://phabricator.wikimedia.org/T329882) [20:58:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:59:38] (03CR) 10Dwisehaupt: [C: 03+2] "looks right. shipit." [dns] - 10https://gerrit.wikimedia.org/r/932004 (https://phabricator.wikimedia.org/T329882) (owner: 10Jgreen) [21:07:15] (03CR) 10Dzahn: "seems like you don't have to worry about netbox yet - there is https://netbox.wikimedia.org/ipam/prefixes/44/ip-addresses/ but nothing to " [dns] - 10https://gerrit.wikimedia.org/r/932004 (https://phabricator.wikimedia.org/T329882) (owner: 10Jgreen) [21:12:05] (03PS1) 10Dzahn: phabricator: add /home to Bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/932007 [21:17:23] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:19] (03CR) 10Dzahn: "nothing to worry, size-wise it's tiny:" [puppet] - 10https://gerrit.wikimedia.org/r/932007 (owner: 10Dzahn) [21:19:01] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:19:13] (03CR) 10Dzahn: [C: 03+2] phabricator: add /home to Bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/932007 (owner: 10Dzahn) [21:23:10] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:23:42] (03PS6) 10BCornwall: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) [21:23:55] !log eevans@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [21:23:57] !log eevans@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [21:24:19] !log eevans@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [21:26:17] (03CR) 10CI reject: [V: 04-1] Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [21:28:32] (03CR) 10Jgreen: Switch civicrm.wm.o to the new server. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/932004 (https://phabricator.wikimedia.org/T329882) (owner: 10Jgreen) [21:28:58] !log eevans@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [21:30:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:54] !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS bullseye [21:31:01] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye [21:34:21] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10neriah) @Aklapper Seems like a good idea. can you send it there? [21:35:23] (03CR) 10Dzahn: Switch civicrm.wm.o to the new server. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/932004 (https://phabricator.wikimedia.org/T329882) (owner: 10Jgreen) [21:36:25] (03CR) 10Dzahn: [C: 03+1] vrts: post decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) (owner: 10AOkoth) [21:37:48] (03CR) 10Jgreen: Switch civicrm.wm.o to the new server. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/932004 (https://phabricator.wikimedia.org/T329882) (owner: 10Jgreen) [21:39:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:07] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet iDRAC issues - https://phabricator.wikimedia.org/T340055 (10Eevans) >>! In T340055#8953807, @Jhancock.wm wrote: > @ssingh I might be mistaken, but I think that firmware update is just for 10G NIC cards I think you're right (see: https://www.dell.com... [21:54:42] (03PS7) 10BCornwall: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) [21:54:44] (03PS1) 10BCornwall: __init__: Fix SREBatchRunnerBase restart_daemon() [cookbooks] - 10https://gerrit.wikimedia.org/r/932010 [21:54:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:43] (03CR) 10CI reject: [V: 04-1] Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [22:00:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:02:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:03:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50134 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:03:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.250 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:04:07] (03PS1) 10JHathaway: dev env, ssh::client: create /etc/ssh dir [puppet] - 10https://gerrit.wikimedia.org/r/932013 (https://phabricator.wikimedia.org/T337972) [22:05:04] (03PS1) 10JHathaway: dev env, ssh::server: create /run/ssh dir [puppet] - 10https://gerrit.wikimedia.org/r/932014 (https://phabricator.wikimedia.org/T337972) [22:05:37] (03PS1) 10JHathaway: puppetserver: fix config perms [puppet] - 10https://gerrit.wikimedia.org/r/932015 (https://phabricator.wikimedia.org/T339913) [22:06:02] (03PS1) 10JHathaway: puppetserver::ca: add trailing newline [puppet] - 10https://gerrit.wikimedia.org/r/932016 (https://phabricator.wikimedia.org/T339913) [22:07:00] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932013 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [22:07:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:07:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932015 (https://phabricator.wikimedia.org/T339913) (owner: 10JHathaway) [22:07:16] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932016 (https://phabricator.wikimedia.org/T339913) (owner: 10JHathaway) [22:07:45] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932014 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [22:10:24] !log rsyncing data from cobalt.wikimedia.org (:p) from gerrit1001 to gerrit1003, /srv/gerrit/cobalt/ [22:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts gerrit1001.wikimedia.org [22:16:17] !log destroying previous production gerrit server gerrit1001 - T336427 [22:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:21] T336427: decom gerrit1001 - https://phabricator.wikimedia.org/T336427 [22:16:33] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 29 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:19:14] !log eevans@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2001.codfw.wmnet with OS bullseye [22:19:20] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye executed with errors: - sessi... [22:22:51] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [22:25:04] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: gerrit1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [22:25:09] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 51 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:26:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: gerrit1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [22:26:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:26:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gerrit1001.wikimedia.org [22:28:38] (03PS1) 10Dzahn: gerrit: remove gerrit1001 from site and gerrit2001 hiera data [puppet] - 10https://gerrit.wikimedia.org/r/932021 (https://phabricator.wikimedia.org/T336427) [22:28:40] (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager.py: don't make incremental backups giant volumes [puppet] - 10https://gerrit.wikimedia.org/r/932022 (https://phabricator.wikimedia.org/T339830) [22:29:12] (03CR) 10CI reject: [V: 04-1] wmcs-cinder-backup-manager.py: don't make incremental backups giant volumes [puppet] - 10https://gerrit.wikimedia.org/r/932022 (https://phabricator.wikimedia.org/T339830) (owner: 10Andrew Bogott) [22:29:31] (03PS2) 10Andrew Bogott: wmcs-cinder-backup-manager.py: only full backups of giant volumes [puppet] - 10https://gerrit.wikimedia.org/r/932022 (https://phabricator.wikimedia.org/T339830) [22:29:59] (03CR) 10CI reject: [V: 04-1] wmcs-cinder-backup-manager.py: only full backups of giant volumes [puppet] - 10https://gerrit.wikimedia.org/r/932022 (https://phabricator.wikimedia.org/T339830) (owner: 10Andrew Bogott) [22:30:21] (03CR) 10Dzahn: gerrit: remove gerrit1001 from site and gerrit2001 hiera data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932021 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [22:30:27] (03PS3) 10Andrew Bogott: wmcs-cinder-backup-manager.py: only full backups of giant volumes [puppet] - 10https://gerrit.wikimedia.org/r/932022 (https://phabricator.wikimedia.org/T339830) [22:30:34] (03CR) 10Dzahn: [C: 03+2] gerrit: remove gerrit1001 from site and gerrit2001 hiera data [puppet] - 10https://gerrit.wikimedia.org/r/932021 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [22:31:56] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager.py: only full backups of giant volumes [puppet] - 10https://gerrit.wikimedia.org/r/932022 (https://phabricator.wikimedia.org/T339830) (owner: 10Andrew Bogott) [22:32:46] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [22:33:15] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1023 [22:33:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1023 [22:33:20] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) gerrit1001 has been destroyed today [22:33:22] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1023 [22:33:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1023 [22:33:31] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [22:35:32] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbproxy1023 - jclark@cumin1001" [22:36:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbproxy1023 - jclark@cumin1001" [22:36:16] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:38:05] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1023 [22:38:06] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1023 [22:38:13] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1023 [22:38:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1023 [22:38:30] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1023 [22:38:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1023 [22:39:09] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1023 [22:39:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1023 [22:39:38] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1023 [22:39:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1023 [22:40:37] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 35 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:45:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:07] 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10netbox, 10serviceops-collab: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Dzahn) [22:47:26] 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10netbox, 10serviceops-collab: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Dzahn) @Volans Could you delete / sync the special entry in netbox for former gerrit IP 208.80.154.137 (a... [22:48:15] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 61 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:48:27] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp1077.eqiad.wmnet,cp1079.eqiad.wmnet,cp1081.eqiad.wmnet,cp1083.eqiad.wmnet,cp1085.eqiad.wmnet,cp1087.eqiad.wmnet,cp1089.eqiad.wmnet} and A:cp [22:48:45] 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10netbox, 10serviceops-collab: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Dzahn) a:05Dzahn→03None [22:49:56] 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10netbox, 10serviceops-collab: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Dzahn) p:05Triage→03Medium [22:51:08] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [22:51:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:03] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp1078.eqiad.wmnet,cp1080.eqiad.wmnet,cp1082.eqiad.wmnet,cp1084.eqiad.wmnet,cp1086.eqiad.wmnet,cp1088.eqiad.wmnet,cp1090.eqiad.wmnet} and A:cp [22:53:24] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbproxy1023 - jclark@cumin1001" [22:54:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbproxy1023 - jclark@cumin1001" [22:54:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:54:58] (03PS2) 10BCornwall: useless commit [cookbooks] - 10https://gerrit.wikimedia.org/r/932010 [22:55:04] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [22:55:14] (03Abandoned) 10BCornwall: useless commit [cookbooks] - 10https://gerrit.wikimedia.org/r/932010 (owner: 10BCornwall) [22:55:23] RECOVERY - WDQS SPARQL on wdqs2021 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 1.306 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:57:10] (03PS1) 10EoghanGaffney: releases: Add motd warning about upcoming host change [puppet] - 10https://gerrit.wikimedia.org/r/932026 [22:58:51] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1023 [22:59:09] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 21 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:00:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1023 [23:01:40] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [23:01:47] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [23:02:06] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host phab-test1001.eqiad.wmnet [23:02:08] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [23:03:06] (03PS1) 10EoghanGaffney: releases: Add new releases hosts to docker_registry_ha allowlist [puppet] - 10https://gerrit.wikimedia.org/r/932027 [23:03:58] (03CR) 10Dzahn: [C: 03+1] releases: Add new releases hosts to docker_registry_ha allowlist [puppet] - 10https://gerrit.wikimedia.org/r/932027 (owner: 10EoghanGaffney) [23:04:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:56] (03CR) 10Dzahn: [C: 03+1] "I'd be fine just merging this." [puppet] - 10https://gerrit.wikimedia.org/r/932027 (owner: 10EoghanGaffney) [23:07:10] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [23:07:49] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 54 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:07:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [23:07:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:07:55] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache phab-test1001.eqiad.wmnet on all recursors [23:07:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) phab-test1001.eqiad.wmnet on all recursors [23:08:23] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [23:09:07] !log created temporary test VM phab-test1001.eqiad.wmnet which we need for a one-time test for T335080 - it will soon be destroyed again [23:09:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [23:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:11] T335080: phabricator->phorge migration - database handling - https://phabricator.wikimedia.org/T335080 [23:09:48] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host phab-test1001.eqiad.wmnet with OS buster [23:15:10] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1023.eqiad.wmnet with OS bullseye [23:15:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors: - db... [23:15:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:46] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [23:15:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [23:20:22] (03PS8) 10BCornwall: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) [23:21:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:53] (03CR) 10CI reject: [V: 04-1] Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [23:23:29] (03PS1) 10Dzahn: site: add phab-test1001.eqiad.wmnet, temporarily [puppet] - 10https://gerrit.wikimedia.org/r/932028 (https://phabricator.wikimedia.org/T335080) [23:25:33] (03PS2) 10Dzahn: site: add phab-test1001.eqiad.wmnet, temporarily [puppet] - 10https://gerrit.wikimedia.org/r/932028 (https://phabricator.wikimedia.org/T335080) [23:26:44] (03CR) 10Dzahn: [C: 03+2] site: add phab-test1001.eqiad.wmnet, temporarily [puppet] - 10https://gerrit.wikimedia.org/r/932028 (https://phabricator.wikimedia.org/T335080) (owner: 10Dzahn) [23:27:20] !log Move large translatable page (`mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki 'Movement Strategy and Governance/Movement Charter Amb[776/776] Program/grant' 'Movement Charter/Ambassadors Program/Grant' 'Martin Urbanec' --reason='restructuring of the Movement Charter's Meta infrastructure (per [[:phab:T338808|request]])'`; T338808) [23:27:21] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host phab-test1001.eqiad.wmnet with OS buster [23:27:21] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host phab-test1001.eqiad.wmnet [23:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:24] T338808: Request to move translatable page: Movement Strategy and Governance/Movement Charter Ambassadors Program/grant - https://phabricator.wikimedia.org/T338808 [23:30:49] !log Move a large translatable page (T339154) [23:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:53] T339154: Request to move translatable page: meta:Global Data and Insights Team - https://phabricator.wikimedia.org/T339154 [23:31:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:16] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host phab-test1001.eqiad.wmnet [23:32:17] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [23:32:56] !log Move a large translatable page on foundationwiki (T338217) [23:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:00] T338217: Move translated page (IP_Information_tool_guidelines) on Foundation GovWiki - https://phabricator.wikimedia.org/T338217 [23:34:51] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1023.eqiad.wmnet with OS bullseye [23:34:59] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors: - db... [23:35:43] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [23:35:48] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host phab-test1001.eqiad.wmnet [23:36:01] (03CR) 10Tim Starling: Fix some mwscript bugs and clean up the style (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [23:36:28] (03PS6) 10Tim Starling: Fix some mwscript bugs and clean up the style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 [23:37:08] (03CR) 10Tim Starling: "PS6: trivial rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [23:37:14] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [23:37:22] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [23:38:15] (03CR) 10Tim Starling: [C: 03+2] Fix some mwscript bugs and clean up the style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [23:38:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:03] (03Merged) 10jenkins-bot: Fix some mwscript bugs and clean up the style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [23:40:45] (03PS1) 10Dzahn: site: use phab-test1002 instead of phab-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/932032 [23:40:57] (03CR) 10CI reject: [V: 04-1] site: use phab-test1002 instead of phab-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/932032 (owner: 10Dzahn) [23:41:13] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 29 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:41:34] (03PS2) 10Dzahn: site: use phab-test1002 instead of phab-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/932032 [23:41:41] (03CR) 10Dzahn: [C: 03+2] site: use phab-test1002 instead of phab-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/932032 (owner: 10Dzahn) [23:42:53] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host phab-test1002.eqiad.wmnet [23:42:54] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [23:45:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:15] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM phab-test1002.eqiad.wmnet - dzahn@cumin1001" [23:46:47] !log tstarling@deploy1002 Synchronized multiversion: Fix some mwscript bugs and clean up the style (duration: 06m 31s) [23:47:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM phab-test1002.eqiad.wmnet - dzahn@cumin1001" [23:47:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:47:00] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache phab-test1002.eqiad.wmnet on all recursors [23:47:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) phab-test1002.eqiad.wmnet on all recursors [23:47:06] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [23:49:25] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1023.eqiad.wmnet with reason: host reimage [23:49:29] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM phab-test1002.eqiad.wmnet - dzahn@cumin1001" [23:49:49] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 48 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:49:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM phab-test1002.eqiad.wmnet - dzahn@cumin1001" [23:50:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:50:12] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache phab-test1002.eqiad.wmnet on all recursors [23:50:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) phab-test1002.eqiad.wmnet on all recursors [23:50:21] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host phab-test1002.eqiad.wmnet [23:50:36] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host phab-test1002.eqiad.wmnet [23:50:37] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [23:52:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1023.eqiad.wmnet with reason: host reimage [23:52:45] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM phab-test1002.eqiad.wmnet - dzahn@cumin1001" [23:53:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM phab-test1002.eqiad.wmnet - dzahn@cumin1001" [23:53:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:53:29] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache phab-test1002.eqiad.wmnet on all recursors [23:53:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) phab-test1002.eqiad.wmnet on all recursors [23:53:35] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [23:54:43] (03PS1) 10BryanDavis: openstack: Hide locked Developer accounts from Keystone [puppet] - 10https://gerrit.wikimedia.org/r/932034 (https://phabricator.wikimedia.org/T339972) [23:55:50] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM phab-test1002.eqiad.wmnet - dzahn@cumin1001" [23:56:05] (03PS1) 10Dzahn: Revert "site: use phab-test1002 instead of phab-test1001" [puppet] - 10https://gerrit.wikimedia.org/r/931713 [23:56:22] (03PS1) 10Jcrespo: Revert "gerrit: backup /home on gerrit servers in Bacula" [puppet] - 10https://gerrit.wikimedia.org/r/931714 [23:56:24] (03CR) 10Dzahn: [C: 03+2] Revert "site: use phab-test1002 instead of phab-test1001" [puppet] - 10https://gerrit.wikimedia.org/r/931713 (owner: 10Dzahn) [23:56:29] (03CR) 10CI reject: [V: 04-1] Revert "site: use phab-test1002 instead of phab-test1001" [puppet] - 10https://gerrit.wikimedia.org/r/931713 (owner: 10Dzahn) [23:56:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM phab-test1002.eqiad.wmnet - dzahn@cumin1001" [23:56:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:56:34] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache phab-test1002.eqiad.wmnet on all recursors [23:56:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) phab-test1002.eqiad.wmnet on all recursors [23:56:44] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host phab-test1002.eqiad.wmnet [23:57:28] (03PS2) 10Dzahn: Revert "site: use phab-test1002 instead of phab-test1001" [puppet] - 10https://gerrit.wikimedia.org/r/931713 [23:58:27] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host phab-test1001.eqiad.wmnet [23:58:28] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [23:59:42] (03CR) 10BryanDavis: "I tested the general idea in my Striker local dev environment with https://gerrit.wikimedia.org/r/c/labs/striker/+/932031/1/contrib/docker" [puppet] - 10https://gerrit.wikimedia.org/r/932034 (https://phabricator.wikimedia.org/T339972) (owner: 10BryanDavis)