[00:06:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2191.codfw.wmnet with reason: host reimage [00:07:35] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:09:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2191.codfw.wmnet with reason: host reimage [00:09:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:09:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2188.codfw.wmnet with OS bullseye [00:10:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2188.codfw.wmnet with OS bullseye completed: - db2188 (**WARN**) - Removed fro... [00:15:17] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:18:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:18:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2189.codfw.wmnet with OS bullseye [00:18:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2189.codfw.wmnet with OS bullseye completed: - db2189 (**WARN**) - Removed fro... [00:20:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) [00:24:39] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:25:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:25:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2191.codfw.wmnet with OS bullseye [00:25:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2191.codfw.wmnet with OS bullseye completed: - db2191 (**WARN**) - Removed fro... [00:26:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2192.mgmt.codfw.wmnet with reboot policy FORCED [00:27:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2193.mgmt.codfw.wmnet with reboot policy FORCED [00:33:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2190.codfw.wmnet with OS bullseye [00:33:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2190.codfw.wmnet with OS bullseye executed with errors: - db2190 (**FAIL**) -... [00:38:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) [00:38:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2194.mgmt.codfw.wmnet with reboot policy FORCED [00:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945021 [00:38:53] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945021 (owner: 10TrainBranchBot) [00:39:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2190.codfw.wmnet with OS bullseye [00:39:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2190.codfw.wmnet with OS bullseye [00:39:26] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2190.codfw.wmnet with OS bullseye [00:39:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2190.codfw.wmnet with OS bullseye executed with errors: - db2190 (**FAIL**) -... [00:43:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2192.mgmt.codfw.wmnet with reboot policy FORCED [00:45:29] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2192'] [00:45:33] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2192'] [00:45:56] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2192'] [00:50:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2193.mgmt.codfw.wmnet with reboot policy FORCED [00:51:33] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2193'] [00:52:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) [00:55:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) @Jhancock.wm while installing the OS on db2190 for some reason i lost connection to the server. When i checked the switch port the link was showing down. when back on s... [00:55:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945021 (owner: 10TrainBranchBot) [00:57:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2192'] [01:00:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2194.mgmt.codfw.wmnet with reboot policy FORCED [01:14:50] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Nux) Changed hideSidebar upstream. You can update uk.wiki. https://pl.wikipedia.org/wiki/Wikipedysta:Nux/hideSidebar.js That bro... [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:01] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:18:49] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:03] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2193'] [02:19:26] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2194'] [02:22:01] (ProbeDown) resolved: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:22:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2192.codfw.wmnet with OS bullseye [02:22:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2192.codfw.wmnet with OS bullseye [02:26:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2193.codfw.wmnet with OS bullseye [02:26:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db2193.codfw.wmnet with OS bullseye [02:27:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2193.codfw.wmnet with OS bullseye [02:27:31] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:27:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2193.codfw.wmnet with OS bullseye executed with errors: - db2193 (**FAIL**) -... [02:30:43] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:53] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2194'] [02:30:57] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:16] (ProbeDown) resolved: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:32:20] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2194'] [02:32:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2194'] [02:33:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2194.codfw.wmnet with OS bullseye [02:34:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2194.codfw.wmnet with OS bullseye [02:43:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2192.codfw.wmnet with reason: host reimage [02:46:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2192.codfw.wmnet with reason: host reimage [02:53:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [02:56:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [03:00:55] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:03:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:03:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2192.codfw.wmnet with OS bullseye [03:03:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2192.codfw.wmnet with OS bullseye completed: - db2192 (**WARN**) - Removed fro... [03:12:03] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:20:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:20:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2194.codfw.wmnet with OS bullseye [03:21:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2194.codfw.wmnet with OS bullseye completed: - db2194 (**WARN**) - Removed fro... [03:22:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) [03:56:04] (03PS1) 10KartikMistry: Update cxserver to 2023-08-03-132800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/945697 (https://phabricator.wikimedia.org/T338602) [04:12:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:17:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:34:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:39:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:59:02] RECOVERY - BGP status on cr2-codfw is OK: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:00:53] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 65, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:00:57] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:43:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:48:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:53:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [05:56:36] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10Marostegui) We really need to come up with a way to be able to grant root access to clouddb* hosts that doesn't imply root on all the production databases, because that... [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230804T0600) [06:24:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/945546 (https://phabricator.wikimedia.org/T342797) (owner: 10Filippo Giunchedi) [06:28:39] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:28:55] (03PS1) 10Marostegui: site.pp: New hosts db12[34-49] [puppet] - 10https://gerrit.wikimedia.org/r/945700 (https://phabricator.wikimedia.org/T342166) [06:33:39] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:43:49] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10fgiunchedi) Thank you @Papaul ! [06:50:32] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add maryana to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/945546 (https://phabricator.wikimedia.org/T342797) (owner: 10Filippo Giunchedi) [06:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:51:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Maryana Pinchuk - https://phabricator.wikimedia.org/T342797 (10fgiunchedi) [06:52:26] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Maryana Pinchuk - https://phabricator.wikimedia.org/T342797 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @Maryana access will be live in the next 30 mins, I'm optimistically resolving the task -- p... [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230804T0700) [07:09:12] (03CR) 10Marostegui: [C: 03+2] site.pp: New hosts db12[34-49] [puppet] - 10https://gerrit.wikimedia.org/r/945700 (https://phabricator.wikimedia.org/T342166) (owner: 10Marostegui) [07:16:07] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Upgrade new codfw switches to Juniper recommended - https://phabricator.wikimedia.org/T341670 (10ayounsi) [07:16:15] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) [07:19:12] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1026.eqiad.wmnet with OS bullseye [07:19:56] (03Abandoned) 10Ayounsi: Rename protocol icmpv6 to icmp6 [homer/public] - 10https://gerrit.wikimedia.org/r/945550 (owner: 10Ayounsi) [07:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:31:27] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [07:31:32] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1026.eqiad.wmnet with reason: host reimage [07:34:50] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1026.eqiad.wmnet with reason: host reimage [07:37:00] !log installing Django security updates [07:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:43:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 139901 [07:45:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 139901 [07:45:52] (03PS1) 10Ayounsi: Update wheels [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/945748 (https://phabricator.wikimedia.org/T337082) [07:46:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:50:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 398203 [07:51:05] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:51:44] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 398203 [07:52:32] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:59:23] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944943 (owner: 10Muehlenhoff) [08:00:08] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) This got promoted to major. ` cr2-esams> show system alarms 2 alarms currently active Alarm time Class Description 2023-07-28 23:46:09 UTC Major FPC 0 Major Err... [08:00:28] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1026.eqiad.wmnet with OS bullseye [08:01:03] ACKNOWLEDGEMENT - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms ayounsi https://phabricator.wikimedia.org/T318783 - The acknowledgement expires at: 2023-08-14 08:00:37. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:04:18] (03PS1) 10Muehlenhoff: scap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945749 [08:04:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:26] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Thanks for the troubleshooting @brennen and @bd808 . I've done some tests changing oidc settings on the test instance, mo... [08:07:41] (03PS1) 10Muehlenhoff: rabbitmq: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945750 [08:14:50] (03PS1) 10Muehlenhoff: thanos: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945752 [08:15:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff) [08:16:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:34] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10MoritzMuehlenhoff) I think cn and uid are equally stable in practice: - Our current account handling doesn't allow to change eith... [08:25:21] (03PS2) 10Muehlenhoff: scap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945749 [08:25:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff) [08:26:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/935522 (https://phabricator.wikimedia.org/T342724) (owner: 10Krinkle) [08:27:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945752 (owner: 10Muehlenhoff) [08:34:41] (03CR) 10Muehlenhoff: "experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945750 (owner: 10Muehlenhoff) [08:35:54] 10ops-eqiad: asw2-c-eqiad - faulty VC link - https://phabricator.wikimedia.org/T343507 (10ayounsi) p:05Triage→03High [08:36:13] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10darthmon_wmde) hi! firs of all thanks a lot for the quick reaction to this! [08:36:58] ACKNOWLEDGEMENT - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 ayounsi https://phabricator.wikimedia.org/T343507 - The acknowledgement expires at: 2023-08-07 08:36:39. https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [08:37:26] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10darthmon_wmde) L3 signed on Jan 25 2021, 9:44 PM. {F37170145} [08:37:50] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [08:38:24] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10darthmon_wmde) >>! In T342968#9063457, @fgiunchedi wrote: > @darthmon_wmde hello, you mentioned you'll be managing the wikibase releases, as such I take it you'll be added to `... [08:43:17] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) Looking more into the alert and status, both ports on FPC0 PIC2 are down, one of which is the link to asw2-esams, so we have a loss of redundancy (traffic now only goes through c... [08:43:37] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10darthmon_wmde) [08:44:46] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10darthmon_wmde) please @RickiJay-WMDE add your ssh key to the description of this ticket and removed yourself as assignee afterwards. [08:56:49] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff) [08:58:08] (03PS1) 10Muehlenhoff: Update MOU date for mhoutti [puppet] - 10https://gerrit.wikimedia.org/r/945754 [09:00:05] (03CR) 10Muehlenhoff: [C: 03+2] Update MOU date for mhoutti [puppet] - 10https://gerrit.wikimedia.org/r/945754 (owner: 10Muehlenhoff) [09:14:33] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/945755 (https://phabricator.wikimedia.org/T135991) [09:15:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945755 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:25:38] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10darthmon_wmde) > * Verification of your ssh keys out of band, the easiest would be to public the public key on your wiki user page done [09:29:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/945637 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [09:30:18] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945752 (owner: 10Muehlenhoff) [09:36:45] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10fgiunchedi) >>! In T342968#9068565, @darthmon_wmde wrote: > L3 signed on Jan 25 2021, 9:44 PM. {F37170145} thank you! my bad for not checking via username and only first name!... [09:37:12] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10fgiunchedi) [09:41:31] 10sre-alert-triage, 10Data-Platform-SRE: 404 from nginx on wcqs2001 - https://phabricator.wikimedia.org/T342762 (10Gehel) 05Open→03Resolved a:03Gehel [09:49:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/945633 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [10:04:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1001.eqiad.wmnet [10:15:16] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin1001.eqiad.wmnet [10:23:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:23:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:23:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [10:23:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T342617)', diff saved to https://phabricator.wikimedia.org/P50072 and previous config saved to /var/cache/conftool/dbconfig/20230804-102347-ladsgroup.json [10:23:52] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [10:24:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [10:27:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [10:27:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [10:27:48] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: move knowledge-gap endpoint from api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/945607 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [10:28:38] (03Abandoned) 10Hnowlan: trafficserver: add route for device-analytics service [puppet] - 10https://gerrit.wikimedia.org/r/930216 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [10:28:41] (03Merged) 10jenkins-bot: rest-gateway: move knowledge-gap endpoint from api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/945607 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [10:30:22] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10MoritzMuehlenhoff) [10:30:33] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:33:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast3007.wikimedia.org [10:33:46] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:35:52] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10phuedx) [10:37:17] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3007.wikimedia.org - jmm@cumin2002" [10:38:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3007.wikimedia.org - jmm@cumin2002" [10:38:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:38:04] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast3007.wikimedia.org on all recursors [10:38:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast3007.wikimedia.org on all recursors [10:38:16] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [10:38:31] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast3007.wikimedia.org - jmm@cumin2002" [10:39:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast3007.wikimedia.org - jmm@cumin2002" [11:02:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast3007.wikimedia.org with OS bookworm [11:02:34] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast3007.wikimedia.org with OS bookworm [11:03:26] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:52] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:04:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T342617)', diff saved to https://phabricator.wikimedia.org/P50073 and previous config saved to /var/cache/conftool/dbconfig/20230804-110705-ladsgroup.json [11:07:09] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:16:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:16] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P50074 and previous config saved to /var/cache/conftool/dbconfig/20230804-112212-ladsgroup.json [11:22:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:27:06] (03CR) 10David Caro: wmcs: enable isort and black (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936231 (owner: 10David Caro) [11:27:25] (03PS6) 10David Caro: wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231 [11:30:30] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on contint2001.wikimedia.org with reason: Decommissioning [11:30:43] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on contint2001.wikimedia.org with reason: Decommissioning [11:34:25] (03PS1) 10Muehlenhoff: Add new bastions to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/945767 (https://phabricator.wikimedia.org/T343121) [11:34:42] (03PS2) 10Muehlenhoff: Add new bastions to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/945767 (https://phabricator.wikimedia.org/T343121) [11:36:45] (03PS3) 10Muehlenhoff: Add new bastions to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/945767 (https://phabricator.wikimedia.org/T343515) [11:37:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P50075 and previous config saved to /var/cache/conftool/dbconfig/20230804-113718-ladsgroup.json [11:38:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:38:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:38:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T342617)', diff saved to https://phabricator.wikimedia.org/P50076 and previous config saved to /var/cache/conftool/dbconfig/20230804-113848-ladsgroup.json [11:38:52] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:40:39] (03PS1) 10Jforrester: Commented user-defined validator function test that does nothing [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945662 [11:42:20] (03CR) 10Muehlenhoff: [C: 03+2] Add new bastions to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/945767 (https://phabricator.wikimedia.org/T343515) (owner: 10Muehlenhoff) [11:42:45] (03PS8) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) [11:43:12] (03PS6) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [11:44:11] (03CR) 10FNegri: [C: 03+1] "LGTM! I wonder if we should move all those files to a dedicated repo, and let Puppet pull from that repo? Not something that must be addre" [puppet] - 10https://gerrit.wikimedia.org/r/936231 (owner: 10David Caro) [11:48:12] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast3007.wikimedia.org with reason: host reimage [11:51:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast3007.wikimedia.org with reason: host reimage [11:52:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T342617)', diff saved to https://phabricator.wikimedia.org/P50077 and previous config saved to /var/cache/conftool/dbconfig/20230804-115224-ladsgroup.json [11:52:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:52:29] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:52:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:57:36] (03PS1) 10Jforrester: AboutEditMetadataDialog: Don't clear the edit fields when we pick a new language [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945663 (https://phabricator.wikimedia.org/T343380) [11:59:18] (03PS1) 10Jforrester: Remove 'wikilambda-edit' as default right; re-label to make clear [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945664 (https://phabricator.wikimedia.org/T343400) [12:00:10] (03CR) 10Jforrester: [C: 03+2] "Fixing tests so we can land stuff." [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945662 (owner: 10Jforrester) [12:00:16] (03CR) 10Jforrester: [C: 03+2] AboutEditMetadataDialog: Don't clear the edit fields when we pick a new language [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945663 (https://phabricator.wikimedia.org/T343380) (owner: 10Jforrester) [12:00:22] (03CR) 10Jforrester: [C: 03+2] Remove 'wikilambda-edit' as default right; re-label to make clear [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945664 (https://phabricator.wikimedia.org/T343400) (owner: 10Jforrester) [12:00:48] (03CR) 10Jforrester: [C: 03+2] Wikifunctions: Use orchestrator image that double-checks validation state too [deployment-charts] - 10https://gerrit.wikimedia.org/r/945684 (owner: 10Jforrester) [12:01:37] (03Merged) 10jenkins-bot: Wikifunctions: Use orchestrator image that double-checks validation state too [deployment-charts] - 10https://gerrit.wikimedia.org/r/945684 (owner: 10Jforrester) [12:03:53] (03Merged) 10jenkins-bot: Commented user-defined validator function test that does nothing [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945662 (owner: 10Jforrester) [12:03:59] (03Merged) 10jenkins-bot: AboutEditMetadataDialog: Don't clear the edit fields when we pick a new language [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945663 (https://phabricator.wikimedia.org/T343380) (owner: 10Jforrester) [12:04:02] (03Merged) 10jenkins-bot: Remove 'wikilambda-edit' as default right; re-label to make clear [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945664 (https://phabricator.wikimedia.org/T343400) (owner: 10Jforrester) [12:04:38] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [12:05:21] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [12:06:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast3007.wikimedia.org with OS bookworm [12:06:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast3007.wikimedia.org [12:07:00] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast3007.wikimedia.org with OS bookworm completed: - bast3007 (**WARN**) - Removed... [12:11:09] (03CR) 10David Caro: wmcs: enable isort and black (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936231 (owner: 10David Caro) [12:13:21] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [12:14:03] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: wmcs.spicerack: Setup a host to run cookbooks from prod network - https://phabricator.wikimedia.org/T276440 (10fnegri) [12:14:11] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [12:14:32] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [12:14:44] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [12:16:19] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: wmcs.spicerack: Setup a host to run cookbooks from prod network - https://phabricator.wikimedia.org/T276440 (10fnegri) 05Open→03Resolved a:03fnegri Two hosts have been created (cloudcumin1001.eqiad.wmnet and cloudcumin2001... [12:16:28] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [12:17:05] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: spicerack: sal_logger does not work when running from a laptop - https://phabricator.wikimedia.org/T343336 (10fnegri) [12:20:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T342617)', diff saved to https://phabricator.wikimedia.org/P50079 and previous config saved to /var/cache/conftool/dbconfig/20230804-122042-ladsgroup.json [12:20:46] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:21:01] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: Spicerack: Add CI step to test with wmcs cookbooks - https://phabricator.wikimedia.org/T325758 (10fnegri) [12:25:17] !log jforrester@deploy1002 Synchronized php-1.41.0-wmf.20/extensions/WikiLambda: T343380 and T343400 (duration: 10m 12s) [12:25:21] T343380: Label editor throws away input when selecting language - https://phabricator.wikimedia.org/T343380 [12:25:22] T343400: Allow editing of "about" with less rights - https://phabricator.wikimedia.org/T343400 [12:25:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:30:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:31:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:32:26] !log bounce prometheus@k8s on prometheus100[56] to test failure to reload certs [12:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:57] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q1): [spicerack] support including {project} in SAL messages - https://phabricator.wikimedia.org/T341793 (10fnegri) 05In progress→03Resolved Logs are now working correctly, though the fact they are going through... [12:34:05] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) [12:34:16] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: spicerack: sal_logger does not work when running from a laptop - https://phabricator.wikimedia.org/T343336 (10fnegri) [12:34:45] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [12:35:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3007.wikimedia.org [12:35:43] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) [12:35:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P50080 and previous config saved to /var/cache/conftool/dbconfig/20230804-123548-ladsgroup.json [12:36:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:36:06] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): cloudcumin: decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [12:37:51] PROBLEM - Host dbproxy1018 is DOWN: PING CRITICAL - Packet loss = 100% [12:39:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3007.wikimedia.org [12:40:03] RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [12:40:19] 10SRE, 10ops-eqiad: asw2-c-eqiad - faulty VC link - https://phabricator.wikimedia.org/T343507 (10Jclark-ctr) @ayounsi Fixed down connection. port shows link. will close ticket after error clears [12:41:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM bast3007.wikimedia.org [12:42:30] 10SRE, 10ops-eqiad: asw2-c-eqiad - faulty VC link - https://phabricator.wikimedia.org/T343507 (10Jclark-ctr) 05Open→03Resolved [12:49:00] (03PS1) 10Elukey: admin_ng: raise knative-serving's pod limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/945772 [12:50:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P50081 and previous config saved to /var/cache/conftool/dbconfig/20230804-125055-ladsgroup.json [12:52:20] (03CR) 10Elukey: [C: 03+2] admin_ng: raise knative-serving's pod limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/945772 (owner: 10Elukey) [12:53:59] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10Jclark-ctr) Cleaned fiber replaced optic [12:55:44] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [12:57:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:58:11] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:58:33] (03PS1) 10Volans: sre.hosts.reimage: temporary skip config-masters [cookbooks] - 10https://gerrit.wikimedia.org/r/945774 [12:59:07] (03CR) 10Clément Goubert: [C: 03+1] sre.hosts.reimage: temporary skip config-masters [cookbooks] - 10https://gerrit.wikimedia.org/r/945774 (owner: 10Volans) [12:59:28] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:59:53] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:00:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM bast3007.wikimedia.org [13:01:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [13:01:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast4005.wikimedia.org [13:01:36] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:01:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [13:01:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50082 and previous config saved to /var/cache/conftool/dbconfig/20230804-130142-ladsgroup.json [13:01:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:01:51] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:02:05] (03CR) 10Clément Goubert: [C: 03+2] sre.hosts.reimage: temporary skip config-masters [cookbooks] - 10https://gerrit.wikimedia.org/r/945774 (owner: 10Volans) [13:02:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:04:43] (03Merged) 10jenkins-bot: sre.hosts.reimage: temporary skip config-masters [cookbooks] - 10https://gerrit.wikimedia.org/r/945774 (owner: 10Volans) [13:05:44] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [13:06:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T342617)', diff saved to https://phabricator.wikimedia.org/P50083 and previous config saved to /var/cache/conftool/dbconfig/20230804-130601-ladsgroup.json [13:06:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [13:06:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [13:06:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T342617)', diff saved to https://phabricator.wikimedia.org/P50084 and previous config saved to /var/cache/conftool/dbconfig/20230804-130622-ladsgroup.json [13:08:46] (03PS1) 10Muehlenhoff: mediawiki: Remove Ferm-specific syntax from firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/945776 [13:09:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast4005.wikimedia.org - jmm@cumin2002" [13:12:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast4005.wikimedia.org - jmm@cumin2002" [13:12:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:12:36] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast4005.wikimedia.org on all recursors [13:12:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast4005.wikimedia.org on all recursors [13:12:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast4005.wikimedia.org - jmm@cumin2002" [13:13:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast4005.wikimedia.org - jmm@cumin2002" [13:13:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945776 (owner: 10Muehlenhoff) [13:14:38] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [13:20:01] (03PS1) 10Elukey: custom_deploy.d: increase ingress gw's resources for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/945778 [13:20:45] (03PS1) 10Muehlenhoff: zookeeper: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945779 [13:21:42] (03CR) 10Elukey: [C: 03+2] custom_deploy.d: increase ingress gw's resources for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/945778 (owner: 10Elukey) [13:22:45] (03CR) 10Clément Goubert: [C: 03+1] mediawiki: Remove Ferm-specific syntax from firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/945776 (owner: 10Muehlenhoff) [13:25:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Jhancock.wm) @Papaul db2190 and db2193 have active links now. [13:25:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945779 (owner: 10Muehlenhoff) [13:30:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2195.mgmt.codfw.wmnet with reboot policy FORCED [13:33:19] (03CR) 10Muehlenhoff: [C: 03+2] "Merging." [puppet] - 10https://gerrit.wikimedia.org/r/935522 (https://phabricator.wikimedia.org/T342724) (owner: 10Krinkle) [13:34:27] (03PS1) 10AOkoth: vrts: send /var/log/{clamav,freshclam}.log to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/945781 [13:36:22] (03CR) 10Krinkle: "I'm guessing this requires private repo to be updated first to move the credentials to match the new role." [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [13:36:33] (03CR) 10Krinkle: "(per PCC failure)" [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [13:37:43] (03PS1) 10Muehlenhoff: webperf: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945782 [13:39:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4005.wikimedia.org with OS bookworm [13:39:38] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast4005.wikimedia.org with OS bookworm [13:44:23] (03PS1) 10Giuseppe Lavagetto: puppetmaster::frontend: fetch ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/945783 (https://phabricator.wikimedia.org/T343294) [13:44:47] (03CR) 10CI reject: [V: 04-1] puppetmaster::frontend: fetch ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/945783 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [13:45:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2193.codfw.wmnet with OS bullseye [13:45:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2193.codfw.wmnet with OS bullseye [13:46:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2190.codfw.wmnet with OS bullseye [13:46:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2190.codfw.wmnet with OS bullseye [13:46:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2190.codfw.wmnet with OS bullseye [13:46:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2190.codfw.wmnet with OS bullseye executed with errors: - db2190 (**FAIL**) -... [13:48:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2190.codfw.wmnet with OS bullseye [13:48:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2190.codfw.wmnet with OS bullseye [13:49:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) @Jhancock.wm thank you. [13:50:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2195.mgmt.codfw.wmnet with reboot policy FORCED [13:50:30] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2195'] [13:54:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff) [13:55:16] (03PS1) 10Hnowlan: rest-gateway: add availability route [deployment-charts] - 10https://gerrit.wikimedia.org/r/945784 (https://phabricator.wikimedia.org/T339119) [13:57:19] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast4005.wikimedia.org with reason: host reimage [13:57:45] (03PS1) 10Muehlenhoff: ceph::server: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945785 [14:01:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2195'] [14:02:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast4005.wikimedia.org with reason: host reimage [14:02:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff) [14:03:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945785 (owner: 10Muehlenhoff) [14:05:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2195.codfw.wmnet with OS bullseye [14:05:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2195.codfw.wmnet with OS bullseye [14:05:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2193.codfw.wmnet with reason: host reimage [14:06:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:38] (03PS1) 10Hnowlan: rest-gateway: fix pathing for knowledge-gap [deployment-charts] - 10https://gerrit.wikimedia.org/r/945806 (https://phabricator.wikimedia.org/T342213) [14:07:12] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:07:28] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:08:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2190.codfw.wmnet with reason: host reimage [14:08:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2193.codfw.wmnet with reason: host reimage [14:11:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2190.codfw.wmnet with reason: host reimage [14:11:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:03] (03PS1) 10Jforrester: Wikifunctions: Add oathauth-enable to wikifunctions-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945808 (https://phabricator.wikimedia.org/T342868) [14:15:13] (03PS1) 10Jforrester: Wikifunctions: Tell WikiLambda to stash results in our bespoke cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945809 (https://phabricator.wikimedia.org/T342753) [14:16:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:35] (03PS1) 10Andrew Bogott: Move traffic off of dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/945810 [14:17:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast4005.wikimedia.org with OS bookworm [14:17:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast4005.wikimedia.org [14:17:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50085 and previous config saved to /var/cache/conftool/dbconfig/20230804-141713-ladsgroup.json [14:17:18] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast4005.wikimedia.org with OS bookworm completed: - bast4005 (**PASS**) - Removed from Puppet and Puppet... [14:17:18] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:17:51] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:18:06] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:19:08] (03CR) 10Andrew Bogott: [C: 03+2] Move traffic off of dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/945810 (owner: 10Andrew Bogott) [14:20:10] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:20:25] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:22:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM bast4005.wikimedia.org [14:23:33] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync Hiera after adding bast4005 - jmm@cumin2002" [14:23:47] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:24:32] (03PS1) 10Andrew Bogott: Revert "Move traffic off of dbproxy1018" [puppet] - 10https://gerrit.wikimedia.org/r/945790 [14:25:09] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Move traffic off of dbproxy1018" [puppet] - 10https://gerrit.wikimedia.org/r/945790 (owner: 10Andrew Bogott) [14:25:38] (03PS1) 10Muehlenhoff: Uncomment sysctl-userns alias [puppet] - 10https://gerrit.wikimedia.org/r/945812 [14:25:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:25:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2193.codfw.wmnet with OS bullseye [14:25:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2195.codfw.wmnet with reason: host reimage [14:25:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2193.codfw.wmnet with OS bullseye completed: - db2193 (**PASS**) - Removed fro... [14:25:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync Hiera after adding bast4005 - jmm@cumin2002" [14:26:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM bast4005.wikimedia.org [14:26:25] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:27:26] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:27:41] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:27:57] Hey all - I’d like to deploy a quick updated security mitigation for T336027. I know it’s Friday, but this is a low-risk change to PS.php that will help combat an LTA going into the weekend. [14:28:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T342617)', diff saved to https://phabricator.wikimedia.org/P50086 and previous config saved to /var/cache/conftool/dbconfig/20230804-142851-ladsgroup.json [14:28:55] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:28:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2195.codfw.wmnet with reason: host reimage [14:29:39] sbassett: o/ seems fine, please announce it to #wikimedia-sre so on-call folks awill be aware [14:29:50] *will be [14:30:03] (03PS2) 10Hnowlan: rest-gateway: fix pathing for knowledge-gap [deployment-charts] - 10https://gerrit.wikimedia.org/r/945806 (https://phabricator.wikimedia.org/T342213) [14:31:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:31:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2190.codfw.wmnet with OS bullseye [14:31:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2190.codfw.wmnet with OS bullseye completed: - db2190 (**PASS**) - Removed fro... [14:32:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P50087 and previous config saved to /var/cache/conftool/dbconfig/20230804-143219-ladsgroup.json [14:32:34] 10ops-eqiad, 10sre-alert-triage, 10DC-Ops, 10serviceops: dbproxy1018 network interface down - https://phabricator.wikimedia.org/T343560 (10Jclark-ctr) [14:40:54] !log Deployed updated mitigation for T336027 [14:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:43:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P50088 and previous config saved to /var/cache/conftool/dbconfig/20230804-144357-ladsgroup.json [14:44:03] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:46:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:47:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P50089 and previous config saved to /var/cache/conftool/dbconfig/20230804-144726-ladsgroup.json [14:50:26] RECOVERY - Host dbproxy1018 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:50:56] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 8: https://wikitech.wikimedia.org/wiki/HAProxy [14:51:34] PROBLEM - Check systemd state on dbproxy1018 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:14] 10SRE, 10ops-eqiad, 10sre-alert-triage, 10DC-Ops, 10serviceops: dbproxy1018 network interface down - https://phabricator.wikimedia.org/T343560 (10Papaul) ` {master:2}[edit] papaul@asw2-c-eqiad# run show interfaces ge-5/0/6 descriptions Interface Admin Link Description ge-5/0/6 up up db... [14:53:11] 10SRE, 10ops-eqiad, 10sre-alert-triage, 10DC-Ops, 10serviceops: dbproxy1018 network interface down - https://phabricator.wikimedia.org/T343560 (10Papaul) 05Open→03Resolved This is complete [14:54:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:54:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2195.codfw.wmnet with OS bullseye [14:54:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2195.codfw.wmnet with OS bullseye completed: - db2195 (**PASS**) - Removed fro... [14:56:42] (03PS2) 10Giuseppe Lavagetto: puppetmaster::frontend: fetch ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/945783 (https://phabricator.wikimedia.org/T343294) [14:59:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P50090 and previous config saved to /var/cache/conftool/dbconfig/20230804-145903-ladsgroup.json [14:59:59] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wcqs2001.codfw.wmnet with OS bullseye [15:00:14] RECOVERY - Check systemd state on dbproxy1018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50091 and previous config saved to /var/cache/conftool/dbconfig/20230804-150232-ladsgroup.json [15:02:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [15:02:36] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:02:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [15:02:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:03:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:03:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T342617)', diff saved to https://phabricator.wikimedia.org/P50092 and previous config saved to /var/cache/conftool/dbconfig/20230804-150310-ladsgroup.json [15:03:25] (03CR) 10Elukey: [C: 03+1] zookeeper: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945779 (owner: 10Muehlenhoff) [15:05:18] (03PS3) 10Giuseppe Lavagetto: puppetmaster::frontend: fetch ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/945783 (https://phabricator.wikimedia.org/T343294) [15:07:38] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:06] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T342617)', diff saved to https://phabricator.wikimedia.org/P50093 and previous config saved to /var/cache/conftool/dbconfig/20230804-151409-ladsgroup.json [15:14:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:14:13] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:14:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:14:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [15:14:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [15:14:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T342617)', diff saved to https://phabricator.wikimedia.org/P50094 and previous config saved to /var/cache/conftool/dbconfig/20230804-151435-ladsgroup.json [15:16:09] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs2001.codfw.wmnet with reason: host reimage [15:18:55] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs2001.codfw.wmnet with reason: host reimage [15:22:34] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:31:28] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:49:30] (03PS4) 10Giuseppe Lavagetto: puppetmaster::frontend: fetch ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/945783 (https://phabricator.wikimedia.org/T343294) [15:49:32] (03PS1) 10Giuseppe Lavagetto: profile::cache::base: add netmapper file for proxies [puppet] - 10https://gerrit.wikimedia.org/r/945818 (https://phabricator.wikimedia.org/T343294) [15:49:35] (03PS1) 10Giuseppe Lavagetto: cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) [15:52:10] _joe_: Can you process https://gerrit.wikimedia.org/r/c/operations/puppet/+/945621 today? Puppet is broken on deploy-1004.devtools.eqiad1.wikimedia.cloud in the meantime. [15:52:38] <_joe_> dancy: apologies, it's been a hard day :) I'll try to take a look [15:52:45] Thanks and hugs! [15:58:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T342617)', diff saved to https://phabricator.wikimedia.org/P50095 and previous config saved to /var/cache/conftool/dbconfig/20230804-155816-ladsgroup.json [15:58:20] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:00:39] (03CR) 10RLazarus: [C: 03+1] "LGTM, I'd have some Python nits but they aren't important. I assume you've run this through some test data and it does the right thing wit" [puppet] - 10https://gerrit.wikimedia.org/r/945783 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [16:01:10] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) [16:02:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) 05Open→03Resolved @Marostegui all your's. Have fun [16:04:33] <_joe_> dancy: i'll merge shortly [16:04:39] thx [16:06:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::wanrouter_cache: add wikifunctions placeholder [puppet] - 10https://gerrit.wikimedia.org/r/945621 (https://phabricator.wikimedia.org/T297815) (owner: 10Ahmon Dancy) [16:09:44] (03PS1) 10Giuseppe Lavagetto: Add stub for hieradata for puppetmaster::frontend [labs/private] - 10https://gerrit.wikimedia.org/r/945821 [16:10:03] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add stub for hieradata for puppetmaster::frontend [labs/private] - 10https://gerrit.wikimedia.org/r/945821 (owner: 10Giuseppe Lavagetto) [16:11:50] (03CR) 10Giuseppe Lavagetto: puppetmaster::frontend: fetch ip reputation data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945783 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [16:11:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] puppetmaster::frontend: fetch ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/945783 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [16:12:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T342617)', diff saved to https://phabricator.wikimedia.org/P50096 and previous config saved to /var/cache/conftool/dbconfig/20230804-161212-ladsgroup.json [16:12:16] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:13:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P50097 and previous config saved to /var/cache/conftool/dbconfig/20230804-161322-ladsgroup.json [16:23:59] (03PS1) 10Giuseppe Lavagetto: ip_reputation_vendors: remove conditional define [puppet] - 10https://gerrit.wikimedia.org/r/945822 [16:24:24] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_ip_reputation.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ip_reputation_vendors: remove conditional define [puppet] - 10https://gerrit.wikimedia.org/r/945822 (owner: 10Giuseppe Lavagetto) [16:27:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P50098 and previous config saved to /var/cache/conftool/dbconfig/20230804-162719-ladsgroup.json [16:27:30] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P50099 and previous config saved to /var/cache/conftool/dbconfig/20230804-162829-ladsgroup.json [16:34:56] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_ip_reputation.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:40:16] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_ip_reputation.service,upload_puppet_facts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P50100 and previous config saved to /var/cache/conftool/dbconfig/20230804-164225-ladsgroup.json [16:43:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T342617)', diff saved to https://phabricator.wikimedia.org/P50101 and previous config saved to /var/cache/conftool/dbconfig/20230804-164335-ladsgroup.json [16:43:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:43:39] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:43:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:43:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50102 and previous config saved to /var/cache/conftool/dbconfig/20230804-164356-ladsgroup.json [16:52:18] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T342617)', diff saved to https://phabricator.wikimedia.org/P50103 and previous config saved to /var/cache/conftool/dbconfig/20230804-165731-ladsgroup.json [16:57:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:57:37] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:57:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:57:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50104 and previous config saved to /var/cache/conftool/dbconfig/20230804-165753-ladsgroup.json [16:58:49] (03PS1) 10Giuseppe Lavagetto: ip_reputation_vendors: temoprarily disable the timer [puppet] - 10https://gerrit.wikimedia.org/r/945848 [16:59:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Specifically, the puppetmasters have a 136 GB physical volume that's mostly unused." [puppet] - 10https://gerrit.wikimedia.org/r/945848 (owner: 10Giuseppe Lavagetto) [17:23:12] (SystemdUnitFailed) firing: (2) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:24:44] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wcqs2001.codfw.wmnet with OS bullseye [17:27:27] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:35:33] (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:51:08] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10Papaul) @Jhancock.wm thank you for taking time to work on this node. Since you did all that supposed to be done to troubleshoot this issue and the server is still not booting up, our next step will be to swap the main boa... [17:53:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50105 and previous config saved to /var/cache/conftool/dbconfig/20230804-175348-ladsgroup.json [17:53:52] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:05:44] (03CR) 10Krinkle: Added extended confirmed on nlwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [18:08:28] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:08:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P50106 and previous config saved to /var/cache/conftool/dbconfig/20230804-180854-ladsgroup.json [18:09:18] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:24] PROBLEM - Blazegraph process -wcqs-blazegraph- on wcqs2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:09:46] PROBLEM - Blazegraph Port for wcqs-blazegraph on wcqs2001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:13:03] (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:13:12] (SystemdUnitFailed) firing: (4) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:36] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wcqs2001.codfw.wmnet with reason: T323921 [18:13:39] T323921: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 [18:14:00] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wcqs2001.codfw.wmnet with reason: T323921 [18:15:10] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wcqs2002.codfw.wmnet with OS bullseye [18:16:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50107 and previous config saved to /var/cache/conftool/dbconfig/20230804-181612-ladsgroup.json [18:16:16] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:21:17] (03PS5) 10Krinkle: Added extended confirmed on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [18:22:32] (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:24:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P50108 and previous config saved to /var/cache/conftool/dbconfig/20230804-182400-ladsgroup.json [18:31:18] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs2002.codfw.wmnet with reason: host reimage [18:31:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P50109 and previous config saved to /var/cache/conftool/dbconfig/20230804-183118-ladsgroup.json [18:34:22] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs2002.codfw.wmnet with reason: host reimage [18:34:25] (03PS1) 10Mforns: Bump up mediawiki_history_snapshot to 2023-07 [puppet] - 10https://gerrit.wikimedia.org/r/945852 [18:39:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50110 and previous config saved to /var/cache/conftool/dbconfig/20230804-183906-ladsgroup.json [18:39:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:39:10] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:39:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:39:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T342617)', diff saved to https://phabricator.wikimedia.org/P50111 and previous config saved to /var/cache/conftool/dbconfig/20230804-183927-ladsgroup.json [18:46:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P50112 and previous config saved to /var/cache/conftool/dbconfig/20230804-184625-ladsgroup.json [19:01:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50113 and previous config saved to /var/cache/conftool/dbconfig/20230804-190131-ladsgroup.json [19:01:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [19:01:39] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [19:01:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [19:01:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T342617)', diff saved to https://phabricator.wikimedia.org/P50114 and previous config saved to /var/cache/conftool/dbconfig/20230804-190152-ladsgroup.json [19:09:17] hi operations, is there any SRE that can help me review and merge a puppet change? https://gerrit.wikimedia.org/r/c/operations/puppet/+/945852 It's just the release of a dataset for the AQS, It's explained here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/AQS#Deploy_new_History_snapshot_for_Wikistats_Backend [19:09:45] It also needs a rolling restart of the AQS servers, also explained in the pasted docs. [19:11:52] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wcqs2002.codfw.wmnet with OS bullseye [19:12:04] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921 [19:12:06] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921 [19:12:12] T323921: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 [19:15:00] rzl: hi! could you please help me :] It's the thing I described above. I think it shouldn't take more than 10 minutes? [19:17:21] mforns: hey, typically we'd let the SREs in data engineering handle that, are they available to help? if anything with the AQS restart didn't go normally, that's who I'd want to have nearby [19:17:38] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:17:48] hi rzl! No, that's why I'm asking here... [19:18:11] okay, is it something that urgently needs to happen before the weekend? [19:18:54] It's the release of a monthly dataset that populates AQS and Wikistats (in this case for the month of july) [19:19:15] People are expecting it, but I don't think it's critical [19:20:05] If you prefer, we can wait till Monday, and ask one of data eng SREs. [19:21:29] that's my first choice, yeah -- but if something changes, and this needs to go out right away, happy to revisit [19:22:12] ok, don't worry, I will let the team know, and I don't think people will be against waiting. I kinda also think it's better :] Thanks! [19:22:17] sorry to make you wait :) in the normal case this would be no big deal, but if there's a problem I don't want to have to debug it late on a Friday with nobody else around [19:22:25] of course! [19:37:06] (03PS1) 10Jforrester: ApiFunctionCall: Check calls for Z16K2 and deny those too [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945791 [19:40:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945791 (owner: 10Jforrester) [19:44:06] (03Merged) 10jenkins-bot: ApiFunctionCall: Check calls for Z16K2 and deny those too [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945791 (owner: 10Jforrester) [19:44:21] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:945791|ApiFunctionCall: Check calls for Z16K2 and deny those too]] [19:48:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T342617)', diff saved to https://phabricator.wikimedia.org/P50115 and previous config saved to /var/cache/conftool/dbconfig/20230804-194811-ladsgroup.json [19:48:19] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [19:58:54] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:945791|ApiFunctionCall: Check calls for Z16K2 and deny those too]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:02:54] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [20:03:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P50116 and previous config saved to /var/cache/conftool/dbconfig/20230804-200317-ladsgroup.json [20:03:22] PROBLEM - Blazegraph Port for wcqs-blazegraph on wcqs2002 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:03:48] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:22] PROBLEM - Blazegraph process -wcqs-blazegraph- on wcqs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:04:35] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921 [20:04:38] T323921: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 [20:04:48] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921 [20:08:47] !log jforrester@deploy1002 jforrester: Continuing with sync [20:11:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T342617)', diff saved to https://phabricator.wikimedia.org/P50118 and previous config saved to /var/cache/conftool/dbconfig/20230804-201107-ladsgroup.json [20:11:11] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:11:15] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10bd808) >>! In T320390#9068533, @MoritzMuehlenhoff wrote: > I think cn and uid are equally stable in practice: > - Our current acc... [20:15:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:18:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P50119 and previous config saved to /var/cache/conftool/dbconfig/20230804-201824-ladsgroup.json [20:18:25] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:945791|ApiFunctionCall: Check calls for Z16K2 and deny those too]] (duration: 34m 04s) [20:20:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:21:22] !log imported libvmod-querysort package in bookworm-wikimedia (T342154) [20:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:25] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [20:26:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P50120 and previous config saved to /var/cache/conftool/dbconfig/20230804-202613-ladsgroup.json [20:33:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T342617)', diff saved to https://phabricator.wikimedia.org/P50121 and previous config saved to /var/cache/conftool/dbconfig/20230804-203330-ladsgroup.json [20:33:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [20:33:36] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:33:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [20:33:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50122 and previous config saved to /var/cache/conftool/dbconfig/20230804-203351-ladsgroup.json [20:36:10] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Tsevener) [20:41:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P50123 and previous config saved to /var/cache/conftool/dbconfig/20230804-204120-ladsgroup.json [20:56:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T342617)', diff saved to https://phabricator.wikimedia.org/P50124 and previous config saved to /var/cache/conftool/dbconfig/20230804-205626-ladsgroup.json [20:56:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [20:56:30] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:56:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [20:56:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T342617)', diff saved to https://phabricator.wikimedia.org/P50125 and previous config saved to /var/cache/conftool/dbconfig/20230804-205647-ladsgroup.json [21:14:52] RECOVERY - Blazegraph Port for wcqs-blazegraph on wcqs2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:15:36] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:36] RECOVERY - Blazegraph process -wcqs-blazegraph- on wcqs2001 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:15:58] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124 [21:16:13] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124 (duration: 00m 15s) [21:16:50] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124 [21:16:59] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124 (duration: 00m 09s) [21:19:45] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124 [21:20:30] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124 (duration: 00m 44s) [21:24:03] (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:33:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T342617)', diff saved to https://phabricator.wikimedia.org/P50126 and previous config saved to /var/cache/conftool/dbconfig/20230804-213336-ladsgroup.json [21:33:40] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:34:45] (03PS1) 10Bartosz Dziewoński: Revert "logspam.pl: Filter out some persistent noise" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) [21:34:52] (03PS4) 10AOkoth: vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) [21:34:54] (03PS1) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [21:34:58] (03PS2) 10Bartosz Dziewoński: Revert "logspam.pl: Filter out some persistent noise" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) [21:35:17] (03PS2) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [21:35:25] (03CR) 10CI reject: [V: 04-1] prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [21:42:50] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:43:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50127 and previous config saved to /var/cache/conftool/dbconfig/20230804-214326-ladsgroup.json [21:43:30] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:48:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P50128 and previous config saved to /var/cache/conftool/dbconfig/20230804-214842-ladsgroup.json [21:51:00] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10matmarex) I also ran into this today. In addition to the files in `modules/mailman3/files/templates/`, there's also the file `modules/prof... [21:58:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P50129 and previous config saved to /var/cache/conftool/dbconfig/20230804-215832-ladsgroup.json [22:03:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P50130 and previous config saved to /var/cache/conftool/dbconfig/20230804-220348-ladsgroup.json [22:13:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P50131 and previous config saved to /var/cache/conftool/dbconfig/20230804-221338-ladsgroup.json [22:18:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T342617)', diff saved to https://phabricator.wikimedia.org/P50132 and previous config saved to /var/cache/conftool/dbconfig/20230804-221855-ladsgroup.json [22:18:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [22:18:59] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [22:19:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [22:19:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T342617)', diff saved to https://phabricator.wikimedia.org/P50133 and previous config saved to /var/cache/conftool/dbconfig/20230804-221915-ladsgroup.json [22:28:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T342617)', diff saved to https://phabricator.wikimedia.org/P50134 and previous config saved to /var/cache/conftool/dbconfig/20230804-222845-ladsgroup.json [22:28:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [22:28:49] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [22:29:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [22:29:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T342617)', diff saved to https://phabricator.wikimedia.org/P50135 and previous config saved to /var/cache/conftool/dbconfig/20230804-222905-ladsgroup.json [22:32:23] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124 [22:33:17] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124 (duration: 00m 54s) [22:33:24] RECOVERY - Blazegraph process -wcqs-blazegraph- on wcqs2002 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:33:54] RECOVERY - Blazegraph Port for wcqs-blazegraph on wcqs2002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:35:52] RECOVERY - Check systemd state on wcqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:44:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:50:29] 10SRE: Nutcracker stats monitoring should only listen on localhost - https://phabricator.wikimedia.org/T111934 (10Kappakayala) [22:53:20] 10SRE: Nutcracker stats monitoring should only listen on localhost - https://phabricator.wikimedia.org/T111934 (10Kappakayala) untagging serviceops as there seems to be no action needed from us. Please feel free to re-tag if anything needed from us. Thanks! [22:55:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T342617)', diff saved to https://phabricator.wikimedia.org/P50136 and previous config saved to /var/cache/conftool/dbconfig/20230804-225542-ladsgroup.json [22:55:48] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [23:00:07] !log removing 1 file for legal compliance [23:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:54] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: wikidatardf-lexemes-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P50137 and previous config saved to /var/cache/conftool/dbconfig/20230804-231048-ladsgroup.json [23:22:17] 10SRE, 10ops-eqiad, 10sre-alert-triage, 10DC-Ops: dbproxy1018 network interface down - https://phabricator.wikimedia.org/T343560 (10Kappakayala) [23:23:33] 10SRE, 10ops-eqiad, 10sre-alert-triage, 10DC-Ops: dbproxy1018 network interface down - https://phabricator.wikimedia.org/T343560 (10Kappakayala) Removing serviceops tag as I believe there is no action required from serviceops team. Please feel free to re-tag and comment if there is anything need from servi... [23:25:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P50138 and previous config saved to /var/cache/conftool/dbconfig/20230804-232555-ladsgroup.json [23:41:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T342617)', diff saved to https://phabricator.wikimedia.org/P50139 and previous config saved to /var/cache/conftool/dbconfig/20230804-234101-ladsgroup.json [23:41:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [23:41:06] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [23:41:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [23:41:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1222 (T342617)', diff saved to https://phabricator.wikimedia.org/P50140 and previous config saved to /var/cache/conftool/dbconfig/20230804-234121-ladsgroup.json [23:43:29] (03PS1) 10Gergő Tisza: shell: Always wrap maintenance scripts in mwscript [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945883 (https://phabricator.wikimedia.org/T343291) [23:46:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T342617)', diff saved to https://phabricator.wikimedia.org/P50141 and previous config saved to /var/cache/conftool/dbconfig/20230804-234637-ladsgroup.json [23:46:45] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617