[00:05:40] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:17:24] (03CR) 10Dzahn: [C: 03+2] phabricator: Reintroduce script to ensure correct config ownership/perms [puppet] - 10https://gerrit.wikimedia.org/r/828654 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [00:18:56] (03CR) 10Dzahn: [C: 03+2] "the script has been re-created on phab1001. and /etc/sudoers.d/scap_sudo_rules_phab-deploy_phabricator_deployment" [puppet] - 10https://gerrit.wikimedia.org/r/828654 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [00:19:28] (03PS1) 10Zabe: Add deployment-ms-be07 to swift storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/828661 [00:20:34] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:07] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase201[3-8].codfw.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001 [00:21:12] T316697: Replace expiring Cassandra SSL certificates - https://phabricator.wikimedia.org/T316697 [00:24:49] (03CR) 10Tim Starling: [C: 03+1] "Since this is a blocker for the multi-DC rollout, it would be nice if it could be deployed before the end of the week." [software/conftool] - 10https://gerrit.wikimedia.org/r/828606 (https://phabricator.wikimedia.org/T316482) (owner: 10CDanis) [00:25:00] PROBLEM - SSH on db1116.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:32:06] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:01] (03PS1) 10Zabe: Fix deployment-prep swift cluster label [puppet] - 10https://gerrit.wikimedia.org/r/828664 (https://phabricator.wikimedia.org/T316845) [00:49:39] (03PS2) 10Zabe: Fix deployment-prep swift cluster label [puppet] - 10https://gerrit.wikimedia.org/r/828664 (https://phabricator.wikimedia.org/T316845) [00:52:01] (03PS3) 10Zabe: deploy swift_ring_manager to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/828664 (https://phabricator.wikimedia.org/T316845) [00:52:10] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:14] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:52:50] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:59:44] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:01:56] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:09:24] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:13:51] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [01:20:23] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase201[3-8].codfw.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001 [01:20:28] T316697: Replace expiring Cassandra SSL certificates - https://phabricator.wikimedia.org/T316697 [01:26:20] RECOVERY - SSH on db1116.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:29:13] (KubernetesRsyslogDown) firing: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:48] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:05:40] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:06:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:16:34] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:25:14] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:42:22] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:48:51] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [02:57:06] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:58:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [03:01:08] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:06:50] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:06:50] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:14:10] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:19:04] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:19:06] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:27:48] (03PS1) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 [03:28:34] (03CR) 10CI reject: [V: 04-1] vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (owner: 10AOkoth) [03:33:24] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:35:44] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:38:04] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:47:52] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:50:20] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:55:14] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:57:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:05:06] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:12:24] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:27:06] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:33:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [04:39:10] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:48:26] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:02:50] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:05:26] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:06:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:07:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.333 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:12:22] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:16:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s7 T316111 [05:16:22] T316111: Switchover s7 master (db1136 -> db1181) - https://phabricator.wikimedia.org/T316111 [05:16:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T316111 [05:17:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1181 with weight 0 T316111', diff saved to https://phabricator.wikimedia.org/P33726 and previous config saved to /var/cache/conftool/dbconfig/20220901-051701-ladsgroup.json [05:20:18] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:21:04] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:21:23] (03PS2) 10Tim Starling: Multi-DC stage 3: send 2% of traffic to appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/827616 (https://phabricator.wikimedia.org/T279664) [05:21:25] (03PS2) 10Tim Starling: Multi-DC stage 4: send all traffic to appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/827617 (https://phabricator.wikimedia.org/T279664) [05:21:27] (03PS1) 10Tim Starling: Multi-DC: go back to testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/828677 (https://phabricator.wikimedia.org/T279664) [05:26:48] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:29:13] (KubernetesRsyslogDown) firing: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:33:48] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) So this is the fix: https://github.com/MariaDB/server/commit/92032499874259bae7455130958ea7f38c4d53a3 I am going to che... [05:37:08] (03PS2) 10Ladsgroup: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/826283 (https://phabricator.wikimedia.org/T316111) (owner: 10Gerrit maintenance bot) [05:37:13] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/826283 (https://phabricator.wikimedia.org/T316111) (owner: 10Gerrit maintenance bot) [05:39:18] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:50:44] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:51:20] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:59:12] (03PS1) 10Giuseppe Lavagetto: Revert "Stop all PHP 7.4 user traffic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828591 [06:00:04] kormat, marostegui, and Amir1: That opportune time is upon us again. Time for a Primary database switchover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T0600). [06:00:07] o/ [06:00:17] (03PS2) 10Giuseppe Lavagetto: Revert "Stop all PHP 7.4 user traffic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828591 [06:00:19] o/ [06:00:23] !log Starting s7 eqiad failover from db1136 to db1181 - T316111 [06:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:30] T316111: Switchover s7 master (db1136 -> db1181) - https://phabricator.wikimedia.org/T316111 [06:01:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T316111', diff saved to https://phabricator.wikimedia.org/P33727 and previous config saved to /var/cache/conftool/dbconfig/20220901-060100-ladsgroup.json [06:01:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1181 to s7 primary and set section read-write T316111', diff saved to https://phabricator.wikimedia.org/P33728 and previous config saved to /var/cache/conftool/dbconfig/20220901-060128-ladsgroup.json [06:02:13] edits are flowing back [06:04:18] <_joe_> jouncebot: next [06:04:18] In 0 hour(s) and 55 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T0700) [06:04:34] <_joe_> Amir1: if you're done, I'd do a quick deployment now [06:04:52] _joe_: I'm done for now, doing dns and stuff [06:04:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Stop all PHP 7.4 user traffic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828591 (owner: 10Giuseppe Lavagetto) [06:04:57] you're good mw side [06:05:03] <_joe_> ok :) [06:05:12] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:05:43] (03Merged) 10jenkins-bot: Revert "Stop all PHP 7.4 user traffic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828591 (owner: 10Giuseppe Lavagetto) [06:06:28] (03PS2) 10Ladsgroup: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/826284 (https://phabricator.wikimedia.org/T316111) (owner: 10Gerrit maintenance bot) [06:06:52] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:07:28] (03CR) 10Ladsgroup: [C: 03+2] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/826284 (https://phabricator.wikimedia.org/T316111) (owner: 10Gerrit maintenance bot) [06:09:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1136 T316111', diff saved to https://phabricator.wikimedia.org/P33729 and previous config saved to /var/cache/conftool/dbconfig/20220901-060923-ladsgroup.json [06:09:29] T316111: Switchover s7 master (db1136 -> db1181) - https://phabricator.wikimedia.org/T316111 [06:10:50] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Moving 1% of users to php 7.4 (duration: 03m 55s) [06:12:43] marostegui: all done, now reboot/schema change/fun time [06:12:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:13:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [06:13:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [06:13:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:14:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:14:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:16:46] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:17:04] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:18:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1136.eqiad.wmnet with reason: Maintenance [06:19:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1136.eqiad.wmnet with reason: Maintenance [06:19:20] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [06:20:14] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:21:16] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:24:10] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [06:25:15] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Reverting to no php 7.4 traffic (duration: 03m 44s) [06:28:54] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:37:05] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:37:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [06:37:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [06:45:26] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:46:55] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:50:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2015.codfw.wmnet with OS bullseye [06:50:48] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2015.codfw.wmnet with OS bullseye [06:54:33] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:00:05] Amir1, apergos, and jnuche: That opportune time is upon us again. Time for a UTC morning backport and config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T0700). [07:00:05] _joe_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:09] morning! there are no trainees signed up today, and one patch scheduled for the window, but there is discussion of issues related to that patch in another channel, it's not ready to go out just yet I guess [07:00:20] <_joe_> yeah, sadly no, we won't deploy that patch [07:00:40] go ahead and remove it from the deployment calendar (or move it) when you get a chance [07:04:54] (03PS2) 10Jaime Nuche: Revert comment change [puppet] - 10https://gerrit.wikimedia.org/r/828583 (owner: 10Ahmon Dancy) [07:05:41] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:07:16] (03CR) 10Jaime Nuche: [C: 03+1] Revert comment change [puppet] - 10https://gerrit.wikimedia.org/r/828583 (owner: 10Ahmon Dancy) [07:10:23] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2015.codfw.wmnet with reason: host reimage [07:13:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2015.codfw.wmnet with reason: host reimage [07:17:58] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:23:14] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:26:40] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:32:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2015.codfw.wmnet with OS bullseye [07:33:02] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2015.codfw.wmnet with OS bullseye completed: - ganeti2015 (**PASS**) - Downtimed on... [07:51:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet [07:56:24] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Moving 1% of traffic to php 7.4 (duration: 03m 42s) [07:57:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:00:05] dduvall and hashar: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T0800). [08:00:25] <_joe_> hashar: can we wait like 10 minutes? [08:00:41] even 10 hours [08:00:47] <_joe_> 5 might be enough actually [08:00:56] Dan is running the train tonight, I am merely the backup conductor this week [08:01:58] <_joe_> ah I see [08:02:01] then i can promote the wikis this morning [08:02:05] <_joe_> you were being literal :D [08:02:08] if that can assists for the php 7.4 thing [08:02:19] <_joe_> no need to do anything for me [08:02:29] for the wikis maybe? ;) [08:02:30] <_joe_> I just wanted to have 10 minutes without "new" errors [08:02:51] <_joe_> are you sure the wikis will be better off with the new version? has anyone asked them? [08:02:51] if I remember properly the issue started showing almost immediately [08:02:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2015.codfw.wmnet [08:02:59] <_joe_> :D [08:03:11] <_joe_> hashar: yeah that's why I said 5 minutes in the end [08:03:19] +1 [08:03:38] <_joe_> but yeah, no occurrences [08:03:51] <_joe_> we can safely move on with the train [08:03:52] and in my experience most big issues are caught with group 1, it has enough traffic to catch the serialization issue if it happens [08:03:54] <_joe_> sorry for the wait [08:04:02] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:04:07] I will let dan run it tonigh [08:04:08] t [08:04:21] I really really like TimStarling approach to resolving this issue [08:04:42] I would never have thought of patching up php 7.4 to make it back compatible [08:05:56] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:08:31] (03PS1) 10Vgutierrez: trafficserver: Remove custom log for cp6008 and cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/828758 (https://phabricator.wikimedia.org/T309651) [08:08:52] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:10:31] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37074/console" [puppet] - 10https://gerrit.wikimedia.org/r/828758 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [08:10:57] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Remove custom log for cp6008 and cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/828758 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [08:15:50] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:12] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:16:37] 10SRE, 10SRE-Access-Requests: dbrant uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T316855 (10Jelto) [08:17:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2015.codfw.wmnet to cluster codfw and group D [08:17:25] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on parse1002.eqiad.wmnet with reason: Readding downtime removed by reimage [08:17:39] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on parse1002.eqiad.wmnet with reason: Readding downtime removed by reimage [08:18:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2015.codfw.wmnet to cluster codfw and group D [08:20:36] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:22:53] (03CR) 10Volans: [C: 03+1] "LGTM, just a question from my lack of context:" [software/conftool] - 10https://gerrit.wikimedia.org/r/828606 (https://phabricator.wikimedia.org/T316482) (owner: 10CDanis) [08:25:43] (03PS1) 10Jelto: admin: revoke dbrant ssh key [puppet] - 10https://gerrit.wikimedia.org/r/828760 (https://phabricator.wikimedia.org/T316855) [08:28:55] (03CR) 10Jelto: [C: 03+2] admin: revoke dbrant ssh key [puppet] - 10https://gerrit.wikimedia.org/r/828760 (https://phabricator.wikimedia.org/T316855) (owner: 10Jelto) [08:30:06] (03PS1) 10Muehlenhoff: Extend access for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/828762 [08:30:20] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:31:13] (03PS1) 10Muehlenhoff: Remove access for ricby [puppet] - 10https://gerrit.wikimedia.org/r/828763 [08:31:28] (03PS2) 10Muehlenhoff: Extend access for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/828762 [08:32:47] 10SRE, 10SRE-Access-Requests: dbrant uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T316855 (10Jelto) 05Open→03Stalled p:05Triage→03Medium a:03Dbrant @dbrant I removed your SSH key used on the production cluster. This was due to it also being used in WMCS, and thu... [08:32:53] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [08:33:26] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/828762 (owner: 10Muehlenhoff) [08:36:19] (03PS2) 10Muehlenhoff: Remove access for ricby [puppet] - 10https://gerrit.wikimedia.org/r/828763 [08:37:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/827964 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [08:42:28] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for ricby [puppet] - 10https://gerrit.wikimedia.org/r/828763 (owner: 10Muehlenhoff) [08:44:54] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:48:48] RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I had a closer look and that will even be compatible with our remaining Stretch hosts." [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643) (owner: 10Andrew Bogott) [08:53:30] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:34] (03CR) 10Muehlenhoff: [C: 03+2] routinator: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826869 (owner: 10Muehlenhoff) [09:07:10] (03PS1) 10Clément Goubert: Update wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828786 (https://phabricator.wikimedia.org/T312638) [09:08:57] (03CR) 10Muehlenhoff: [C: 03+2] Stop reporting releng images to debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff) [09:10:22] (03CR) 10Aqu: [C: 03+1] "Hey Sandra," [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [09:15:48] (03CR) 10Aqu: [C: 03+1] Update Puppet files for Airflow Upgrade to 2.3.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [09:22:23] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: spamassassin_updates.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:07] (03CR) 10Muehlenhoff: [C: 03+2] mariadb::config: Remove old tmpfile hack [puppet] - 10https://gerrit.wikimedia.org/r/826858 (owner: 10Muehlenhoff) [09:24:25] (03CR) 10Filippo Giunchedi: "Nice, thank you Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/828622 (owner: 10Dzahn) [09:24:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM I also double checked all IPs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828786 (https://phabricator.wikimedia.org/T312638) (owner: 10Clément Goubert) [09:24:44] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828788 [09:26:11] (03PS1) 10Marostegui: mariadb: Promote pc1014 to pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/828789 [09:26:40] <_joe_> jouncebot: next [09:26:40] In 0 hour(s) and 33 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T1000) [09:27:31] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote pc1014 to pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/828789 (owner: 10Marostegui) [09:27:46] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828788 (owner: 10Marostegui) [09:28:09] (03CR) 10Muehlenhoff: [C: 03+2] Remove zookeeper_version [puppet] - 10https://gerrit.wikimedia.org/r/818863 (https://phabricator.wikimedia.org/T312539) (owner: 10Muehlenhoff) [09:28:18] (03PS1) 10Zabe: Switch to deployment-urldownloader03 [puppet] - 10https://gerrit.wikimedia.org/r/828790 (https://phabricator.wikimedia.org/T278641) [09:28:54] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828788 (owner: 10Marostegui) [09:29:13] (KubernetesRsyslogDown) firing: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:30:29] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:31:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:32:40] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1014 to pc3 master (duration: 03m 34s) [09:32:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:32:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:33:09] (03Abandoned) 10Muehlenhoff: profile::docker::engine: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820745 (owner: 10Muehlenhoff) [09:33:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812286 (owner: 10Muehlenhoff) [09:33:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:35:05] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:35:19] (03Abandoned) 10Muehlenhoff: Stop specifying specific docker releases [puppet] - 10https://gerrit.wikimedia.org/r/670840 (owner: 10Muehlenhoff) [09:35:46] (03PS2) 10Muehlenhoff: installserver: Remove support for pre buster [puppet] - 10https://gerrit.wikimedia.org/r/811681 [09:36:35] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:38:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:39:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:39:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:40:23] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:40:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:45:09] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:45:33] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:45:49] (03CR) 10Muehlenhoff: [C: 03+2] proxysql: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/820737 (owner: 10Muehlenhoff) [09:46:23] (03CR) 10Clément Goubert: [C: 03+2] Update wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828786 (https://phabricator.wikimedia.org/T312638) (owner: 10Clément Goubert) [09:46:35] RECOVERY - cassandra-b service on restbase1033 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:46:45] <_joe_> marostegui: are you done with your change? [09:46:55] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:47:15] RECOVERY - cassandra-b SSL 10.64.48.152:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-b valid until 2024-08-28 11:43:21 +0000 (expires in 727 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [09:47:20] _joe_: yes! [09:47:35] (03Merged) 10jenkins-bot: Update wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828786 (https://phabricator.wikimedia.org/T312638) (owner: 10Clément Goubert) [09:47:41] <_joe_> claime: ^^ [09:47:42] * claime cracks knuckles [09:48:20] (03CR) 10Muehlenhoff: [C: 03+2] druid: Fixed UID/GIDs are universally in use now [puppet] - 10https://gerrit.wikimedia.org/r/812286 (owner: 10Muehlenhoff) [09:48:22] jouncebot: now [09:48:22] For the next 0 hour(s) and 11 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T0800) [09:48:23] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:48:58] (KubernetesRsyslogDown) resolved: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:49:35] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: start carbon.service at boot [puppet] - 10https://gerrit.wikimedia.org/r/828471 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi) [09:49:50] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: properly shut carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/828472 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi) [09:50:17] (03CR) 10Filippo Giunchedi: [C: 03+2] Remove upstart configs in /etc/init/ [puppet] - 10https://gerrit.wikimedia.org/r/828477 (owner: 10Filippo Giunchedi) [09:50:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:51:32] (03PS2) 10Filippo Giunchedi: Remove upstart configs in /etc/init/ [puppet] - 10https://gerrit.wikimedia.org/r/828477 [09:51:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:51:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:51:55] (03CR) 10Filippo Giunchedi: [V: 03+2] Remove upstart configs in /etc/init/ [puppet] - 10https://gerrit.wikimedia.org/r/828477 (owner: 10Filippo Giunchedi) [09:52:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:52:43] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Collect/export helm list call latencies [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/820713 (owner: 10JMeybohm) [09:54:57] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:56:13] (03PS1) 10JMeybohm: helm-state-metrics: Update to v0.1.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/828792 [09:56:55] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:57:21] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] helm-state-metrics: Update to v0.1.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/828792 (owner: 10JMeybohm) [09:57:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:58:30] !log cgoubert@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:828786|Update wgLinterSubmitterWhitelist (T312638)]] (duration: 03m 37s) [09:58:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:58:35] T312638: Parsoid migration to php 7.4 - https://phabricator.wikimedia.org/T312638 [09:58:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:59:40] (03PS1) 10JMeybohm: helm-state-metrics: Update to v0.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/828793 [10:00:04] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T1000). [10:00:09] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:02:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:07:03] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828853 [10:08:16] (03PS1) 10Marostegui: Revert "mariadb: Promote pc1014 to pc3 master" [puppet] - 10https://gerrit.wikimedia.org/r/828854 [10:08:26] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828853 (owner: 10Marostegui) [10:08:42] (03CR) 10Muehlenhoff: [C: 03+2] varnish::common: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/826855 (owner: 10Muehlenhoff) [10:09:39] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828853 (owner: 10Marostegui) [10:10:15] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote pc1014 to pc3 master" [puppet] - 10https://gerrit.wikimedia.org/r/828854 (owner: 10Marostegui) [10:12:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:13:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:13:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:13:35] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1013 backt to pc3 master (duration: 03m 43s) [10:14:19] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10Joe) Regarding the appserver alerts, I think we should go in the following direction: * Have one metric that tells... [10:14:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:17:49] (03PS1) 10Slyngshede: Icinga: Add dry-run option. [software/spicerack] - 10https://gerrit.wikimedia.org/r/828800 (https://phabricator.wikimedia.org/T315537) [10:18:42] (03CR) 10JMeybohm: [C: 03+2] helm-state-metrics: Update to v0.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/828793 (owner: 10JMeybohm) [10:19:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:19:29] (03CR) 10Giuseppe Lavagetto: Move 5% of traffic to php 7.4 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [10:19:34] (03PS5) 10Giuseppe Lavagetto: Move 5% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736) [10:20:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:20:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:21:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:22:59] (03Merged) 10jenkins-bot: helm-state-metrics: Update to v0.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/828793 (owner: 10JMeybohm) [10:24:22] (03CR) 10CI reject: [V: 04-1] Icinga: Add dry-run option. [software/spicerack] - 10https://gerrit.wikimedia.org/r/828800 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [10:26:03] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:26:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [10:29:00] !log klausman@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [10:29:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [10:31:37] (03PS2) 10Slyngshede: Icinga: Add dry-run option. [software/spicerack] - 10https://gerrit.wikimedia.org/r/828800 (https://phabricator.wikimedia.org/T315537) [10:31:46] (03PS7) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 [10:32:45] (JobUnavailable) firing: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:34] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I'm aiming to do stage 3 and 4 on September 6. [10:36:01] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1002.eqiad.wmnet [10:37:04] (03CR) 10Volans: [C: 03+1] "LGTM, is the puppet compiler happy?" [puppet] - 10https://gerrit.wikimedia.org/r/811681 (owner: 10Muehlenhoff) [10:37:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet [10:38:28] (03CR) 10CI reject: [V: 04-1] Icinga: Add dry-run option. [software/spicerack] - 10https://gerrit.wikimedia.org/r/828800 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [10:40:39] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1001.eqiad.wmnet [10:40:39] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1001.eqiad.wmnet [10:41:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet [10:42:05] (03PS3) 10Slyngshede: Icinga: Add dry-run option. [software/spicerack] - 10https://gerrit.wikimedia.org/r/828800 (https://phabricator.wikimedia.org/T315537) [10:43:05] !log pooled parse1001.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638 [10:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:35] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:43:38] !log depooled wtp1034.eqiad.wmnet from parsoid cluster https://phabricator.wikimedia.org/T312638 [10:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:56] (03CR) 10CDanis: [C: 03+2] dbctl: Add omit_replicas_in_mwconfig section attribute (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/828606 (https://phabricator.wikimedia.org/T316482) (owner: 10CDanis) [10:47:45] (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:49:42] (03Merged) 10jenkins-bot: dbctl: Add omit_replicas_in_mwconfig section attribute [software/conftool] - 10https://gerrit.wikimedia.org/r/828606 (https://phabricator.wikimedia.org/T316482) (owner: 10CDanis) [10:52:45] (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:53:09] (03CR) 10Muehlenhoff: installserver: Remove support for pre buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811681 (owner: 10Muehlenhoff) [10:53:27] RECOVERY - mediawiki-installation DSH group on parse1002 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:54:09] (03CR) 10Volans: [C: 03+1] "LGTM, just a couple of wording nits inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/828800 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [10:54:11] (03PS1) 10CDanis: bump version [software/conftool] - 10https://gerrit.wikimedia.org/r/828803 [10:54:13] (03PS1) 10CDanis: debian/changelog [software/conftool] - 10https://gerrit.wikimedia.org/r/828804 [10:55:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki2002.codfw.wmnet [10:56:14] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1002.eqiad.wmnet [10:56:14] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1002.eqiad.wmnet [10:58:38] !log pooled parse1002.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638 [10:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:24] (03PS4) 10Slyngshede: Icinga: Add dry-run option. [software/spicerack] - 10https://gerrit.wikimedia.org/r/828800 (https://phabricator.wikimedia.org/T315537) [10:59:48] (03CR) 10CDanis: [C: 03+2] bump version [software/conftool] - 10https://gerrit.wikimedia.org/r/828803 (owner: 10CDanis) [11:00:03] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:01:13] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/828800 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [11:01:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki2002.codfw.wmnet [11:02:18] (03CR) 10Slyngshede: Icinga: Add dry-run option. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/828800 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [11:02:20] (03Merged) 10jenkins-bot: bump version [software/conftool] - 10https://gerrit.wikimedia.org/r/828803 (owner: 10CDanis) [11:02:39] (03CR) 10CDanis: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/828804 (owner: 10CDanis) [11:02:45] (JobUnavailable) resolved: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:03:13] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:04:20] !log depooled wtp1035.eqiad.wmnet from parsoid cluster https://phabricator.wikimedia.org/T312638 [11:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:16] (03CR) 10Muehlenhoff: [C: 03+2] installserver: Remove support for pre buster [puppet] - 10https://gerrit.wikimedia.org/r/811681 (owner: 10Muehlenhoff) [11:05:18] (03CR) 10CDanis: [C: 03+2] debian/changelog [software/conftool] - 10https://gerrit.wikimedia.org/r/828804 (owner: 10CDanis) [11:05:43] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:05:53] PROBLEM - k8s requests count to the API on ml-serve-ctrl1002 is CRITICAL: 100.4 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [11:06:15] (03PS2) 10Muehlenhoff: imagecatalog: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/774939 [11:07:34] (03Merged) 10jenkins-bot: debian/changelog [software/conftool] - 10https://gerrit.wikimedia.org/r/828804 (owner: 10CDanis) [11:07:44] (03CR) 10Slyngshede: [C: 03+2] Icinga: Add dry-run option. [software/spicerack] - 10https://gerrit.wikimedia.org/r/828800 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [11:08:56] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (10SLyngshede-WMF) 05Open→03In progress Missing the two last bulletpoints: - check if there are other uses of @retry across... [11:14:02] * _joe_ lunch [11:14:09] <_joe_> err wrong channel :D [11:15:29] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:16:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:17:57] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:19:47] 10SRE, 10SRE-Access-Requests: dbrant uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T316855 (10Dbrant) 05Stalled→03Open a:05Dbrant→03Jelto d'oh, sorry about that -- not sure how I mixed that up. Here is my new key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIELYi21hPrBhAm... [11:25:57] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:27:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P33732 and previous config saved to /var/cache/conftool/dbconfig/20220901-112733-ladsgroup.json [11:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:27:39] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:29:43] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [11:31:02] (03CR) 10Muehlenhoff: [C: 03+2] imagecatalog: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/774939 (owner: 10Muehlenhoff) [11:33:21] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:34:33] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [11:36:18] (03PS4) 10Muehlenhoff: klaxon: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773772 [11:40:34] (03PS1) 10Marostegui: mariadb: Promote db1183 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/828903 (https://phabricator.wikimedia.org/T316744) [11:42:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P33733 and previous config saved to /var/cache/conftool/dbconfig/20220901-114239-ladsgroup.json [11:48:40] !log root@apt1001:/home/cdanis/build-area# reprepro --ignore=wrongdistribution -C main include bullseye-wikimedia conftool_2.2.2-1_amd64.changes [11:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:21] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:53:11] (03CR) 10Muehlenhoff: [C: 03+2] klaxon: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773772 (owner: 10Muehlenhoff) [11:57:03] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:57:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:57:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P33734 and previous config saved to /var/cache/conftool/dbconfig/20220901-115746-ladsgroup.json [11:59:42] (03PS1) 10CDanis: dbctl: update schema for 2.2.2 [puppet] - 10https://gerrit.wikimedia.org/r/828946 (https://phabricator.wikimedia.org/T316482) [11:59:48] !log rebalance row B after completed Bullseye updates T311686 [11:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:53] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [12:01:45] (03CR) 10CDanis: [V: 03+2 C: 03+2] dbctl: update schema for 2.2.2 [puppet] - 10https://gerrit.wikimedia.org/r/828946 (https://phabricator.wikimedia.org/T316482) (owner: 10CDanis) [12:03:30] !log klausman@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-eqiad [12:03:49] (03CR) 10Volans: "reply inline plus comments on the implementation of one of the cookbooks. It applies to the other too ofc." [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff) [12:04:47] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:04:59] (03PS1) 10CDanis: dbctl: update schema for 2.2.2 [puppet] - 10https://gerrit.wikimedia.org/r/828947 (https://phabricator.wikimedia.org/T316482) [12:05:13] (03CR) 10Volans: [C: 03+2] "As it's trivial, self-merging it. Happy to adapt if any comment will come later on." [cookbooks] - 10https://gerrit.wikimedia.org/r/828609 (owner: 10Volans) [12:05:30] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185) [12:05:44] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185) [12:07:08] (03CR) 10CDanis: [V: 03+2 C: 03+2] dbctl: update schema for 2.2.2 [puppet] - 10https://gerrit.wikimedia.org/r/828947 (https://phabricator.wikimedia.org/T316482) (owner: 10CDanis) [12:07:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:08:19] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:09:06] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: sort drivers files [cookbooks] - 10https://gerrit.wikimedia.org/r/828609 (owner: 10Volans) [12:12:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P33735 and previous config saved to /var/cache/conftool/dbconfig/20220901-121252-ladsgroup.json [12:12:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:12:58] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [12:13:01] PROBLEM - Host ml-serve-ctrl1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:13:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:13:17] RECOVERY - Host ml-serve-ctrl1002 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [12:13:28] !log klausman@cumin1001 START - Cookbook sre.hosts.remove-downtime for ml-serve-ctrl1001.eqiad.wmnet [12:13:28] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-serve-ctrl1001.eqiad.wmnet [12:14:49] RECOVERY - k8s requests count to the API on ml-serve-ctrl1002 is OK: (C)100 ge (W)50 ge 0.8458 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [12:15:55] RECOVERY - cassandra-b CQL 10.64.48.152:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.152 port 9042 https://phabricator.wikimedia.org/T93886 [12:16:23] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:16:58] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:17:46] (03PS1) 10Muehlenhoff: Add a cookbook to perform a rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 [12:19:25] (03CR) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff) [12:20:27] !log cdanis@cumin2002 dbctl commit (dc=all): 'T316482 remove replicas from x2', diff saved to https://phabricator.wikimedia.org/P33736 and previous config saved to /var/cache/conftool/dbconfig/20220901-122026-cdanis.json [12:20:34] T316482: Update wgLBFactoryConf for x2 to register only the local primary - https://phabricator.wikimedia.org/T316482 [12:23:18] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:11] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:24:11] PROBLEM - k8s requests count to the API on ml-serve-ctrl1001 is CRITICAL: 106.9 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [12:24:18] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:42] (03CR) 10CI reject: [V: 04-1] Add a cookbook to perform a rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 (owner: 10Muehlenhoff) [12:27:32] (03CR) 10DCausse: [C: 03+1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [12:28:18] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:28:29] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:29:18] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:29:28] !log restarted thanos-query on thanos-fe1001 [12:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:41] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:32:32] (03PS2) 10Muehlenhoff: Add a cookbook to perform a rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 [12:32:51] <_joe_> jouncebot: next [12:32:51] In 0 hour(s) and 27 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T1300) [12:32:51] In 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T1300) [12:36:01] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:37:53] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:40:49] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:41:23] RECOVERY - cassandra-c service on restbase1033 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:43:39] RECOVERY - cassandra-c SSL 10.64.48.153:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-c valid until 2024-08-28 11:43:23 +0000 (expires in 726 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [12:45:41] PROBLEM - k8s requests count to the API on ml-serve-ctrl1001 is CRITICAL: 107.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [12:48:43] 10SRE-tools, 10Infrastructure-Foundations, 10Release-Engineering-Team: Investigate sharing releng common python code to pywmflib - https://phabricator.wikimedia.org/T316757 (10Volans) @hashar thanks for the comprehensive summary of our IRC chat. I don't mind either direction, create a separate library for y... [12:48:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [12:50:03] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:50:35] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:54:36] (03PS2) 10Muehlenhoff: Failover IDP to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/827985 [12:55:27] (03PS1) 10JMeybohm: helm-state-metrics: Account for sudden memory spikes [deployment-charts] - 10https://gerrit.wikimedia.org/r/828955 [12:55:45] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] helm-state-metrics: Account for sudden memory spikes [deployment-charts] - 10https://gerrit.wikimedia.org/r/828955 (owner: 10JMeybohm) [12:56:43] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:56:58] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:57:30] (03CR) 10Muehlenhoff: [C: 03+2] Failover IDP to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/827985 (owner: 10Muehlenhoff) [12:58:12] (03PS1) 10JMeybohm: helm-state-metrics: Forgot to bump chart version for new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/828957 [12:58:23] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] helm-state-metrics: Forgot to bump chart version for new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/828957 (owner: 10JMeybohm) [12:58:25] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:59:57] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:00:04] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:00:04] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T1300) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T1300). [13:00:04] _joe_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:31] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:00:49] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:02:29] <_joe_> hi! [13:03:19] <_joe_> is it just me? [13:03:45] <_joe_> I guess I can just deploy my change then by myself [13:03:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Move 5% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [13:04:35] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:05:02] (03Merged) 10jenkins-bot: Move 5% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [13:05:07] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:06:50] (03CR) 10Volans: [C: 04-1] "Missing restart_daemons(), beside that LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 (owner: 10Muehlenhoff) [13:08:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:09:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:09:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:09:51] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:823677|Move 5% of traffic to php 7.4 (T271736)]] (duration: 03m 45s) [13:09:55] T271736: Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 [13:10:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:12:06] 10SRE, 10SRE Observability: Consider bringing thanos-query logs into logstash - https://phabricator.wikimedia.org/T316867 (10herron) p:05Triage→03Medium [13:13:49] (03PS1) 10Herron: logstash: output thanos-query syslogs to kafka and local file [puppet] - 10https://gerrit.wikimedia.org/r/828960 (https://phabricator.wikimedia.org/T316867) [13:15:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:15:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:15:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:16:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:18:41] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:18:41] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:18:59] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:19:06] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:19:09] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:19:20] (03PS1) 10Jelto: admin: add new ssh key for dbrant [puppet] - 10https://gerrit.wikimedia.org/r/828961 (https://phabricator.wikimedia.org/T316855) [13:19:22] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:19:27] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:19:41] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:20:01] 10SRE, 10Traffic: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez this seems to be working and not breaking anything :). As a direct result cache hitrate shows up to a 100% increase in the text cluster... [13:20:24] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 01): eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10JArguello-WMF) [13:21:04] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10JArguello-WMF) [13:22:11] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:23:13] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:25:56] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 01): eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10lbowmaker) a:03gmodena [13:26:15] (03CR) 10Muehlenhoff: Add a cookbook to perform a rolling restart/reboot of an LDAP replica cluster (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 (owner: 10Muehlenhoff) [13:26:17] (03PS3) 10Muehlenhoff: Add a cookbook to perform a rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 [13:26:30] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10lbowmaker) a:05Jelto→03gmodena [13:27:03] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:27:29] (03CR) 10Jelto: [C: 03+2] admin: add new ssh key for dbrant [puppet] - 10https://gerrit.wikimedia.org/r/828961 (https://phabricator.wikimedia.org/T316855) (owner: 10Jelto) [13:29:26] (03PS1) 10Clément Goubert: cloudweb: Explicit docker-ce package installation [puppet] - 10https://gerrit.wikimedia.org/r/828963 (https://phabricator.wikimedia.org/T316639) [13:30:49] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:31:07] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) >>! In T296832#8143427, @cmooney wrote: > For a bit of context the above patch will augment the existing vars... [13:31:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2135,2160].codfw.wmnet,db[1107,1117,1183].eqiad.wmnet with reason: switchover m5 T316744 [13:31:49] T316744: Switchover m5 master (db1107 -> db1183) - https://phabricator.wikimedia.org/T316744 [13:31:49] (03CR) 10Muehlenhoff: "Why's that, they currently use docker.io? https://debmonitor.wikimedia.org/hosts/cloudweb1003.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/828963 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [13:32:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2135,2160].codfw.wmnet,db[1107,1117,1183].eqiad.wmnet with reason: switchover m5 T316744 [13:33:17] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:33:39] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1183 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/828903 (https://phabricator.wikimedia.org/T316744) (owner: 10Marostegui) [13:34:50] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Bonus Level 🕹️): git-fat replacement/removal - https://phabricator.wikimedia.org/T279509 (10demon) [13:35:47] (03CR) 10Clément Goubert: cloudweb: Explicit docker-ce package installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828963 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [13:36:17] (03Abandoned) 10Clément Goubert: cloudweb: Explicit docker-ce package installation [puppet] - 10https://gerrit.wikimedia.org/r/828963 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [13:37:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox2002.codfw.wmnet [13:38:35] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:39:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice work! That's a big change! First round of comments inline, adding also Janis to review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [13:39:21] (03CR) 10Muehlenhoff: "Ack :-) BTW, you can also search for a package name like this: https://debmonitor.wikimedia.org/packages/docker-ce (as such this is only n" [puppet] - 10https://gerrit.wikimedia.org/r/828963 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [13:41:29] 10SRE, 10SRE-Access-Requests: dbrant uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T316855 (10Jelto) 05Open→03Resolved Key confirmed also via gmail and merged. You should have access in around 30 minutes. I'm closing this task. Feel free to re-open if you have problem... [13:41:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox2002.codfw.wmnet [13:41:35] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:42:53] (03CR) 10Clément Goubert: cloudweb: Explicit docker-ce package installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828963 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [13:43:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox1002.eqiad.wmnet [13:43:35] !log rebooting netbox1002 (running netbox.wikimedia.org) [13:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:04] (03CR) 10Muehlenhoff: cloudweb: Explicit docker-ce package installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828963 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [13:48:49] (03PS1) 10Clément Goubert: P:Docker::Engine: Default to installing docker.io [puppet] - 10https://gerrit.wikimedia.org/r/828965 (https://phabricator.wikimedia.org/T316639) [13:50:26] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37081/console" [puppet] - 10https://gerrit.wikimedia.org/r/828965 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [13:52:51] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netbox1002.eqiad.wmnet [13:53:07] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:53:41] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:53:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2002.codfw.wmnet [13:53:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [13:54:35] (03CR) 10Andrew Bogott: [C: 03+2] P:systemd::timedated: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643) (owner: 10Andrew Bogott) [13:56:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/828965 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [13:56:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2002.codfw.wmnet [13:57:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1002.eqiad.wmnet [14:00:09] !log Failover m5 from db1107 to db1183 - T316744 [14:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:14] T316744: Switchover m5 master (db1107 -> db1183) - https://phabricator.wikimedia.org/T316744 [14:00:19] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:01:20] !log test T316744 [14:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:37] !log test T316744 [14:01:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1002.eqiad.wmnet [14:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:44] (03PS2) 10Clément Goubert: P:Docker::Engine: Default to installing docker.io [puppet] - 10https://gerrit.wikimedia.org/r/828965 (https://phabricator.wikimedia.org/T316639) [14:01:46] (03PS1) 10Clément Goubert: docker: Clean hiera following P:DockerEngine cleanup [puppet] - 10https://gerrit.wikimedia.org/r/829011 [14:03:14] (03PS1) 10Tchanders: Enable partial action blocks on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829012 (https://phabricator.wikimedia.org/T315525) [14:03:34] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 6 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37082/console" [puppet] - 10https://gerrit.wikimedia.org/r/829011 (owner: 10Clément Goubert) [14:04:32] (03CR) 10Clément Goubert: P:Docker::Engine: Default to installing docker.io (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828965 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [14:05:39] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:06:22] (03PS2) 10Clément Goubert: docker: Clean hiera following P:DockerEngine cleanup [puppet] - 10https://gerrit.wikimedia.org/r/829011 (https://phabricator.wikimedia.org/T316639) [14:07:19] !log installing net-snmp security updates on Buster [14:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/828965 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [14:09:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit." [puppet] - 10https://gerrit.wikimedia.org/r/829011 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [14:10:18] (03CR) 10Clément Goubert: docker: Clean hiera following P:DockerEngine cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829011 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [14:13:50] (03CR) 10Marostegui: [C: 03+2] dbproxy1017,dbproxy1021: Add db1117:3325 as standby [puppet] - 10https://gerrit.wikimedia.org/r/829013 (https://phabricator.wikimedia.org/T316870) (owner: 10Marostegui) [14:14:43] (03CR) 10Clément Goubert: [C: 03+2] P:Docker::Engine: Default to installing docker.io [puppet] - 10https://gerrit.wikimedia.org/r/828965 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [14:15:01] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:15:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/829011 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [14:18:59] (03CR) 10AGueyte: [C: 03+1] Enable partial action blocks on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829012 (https://phabricator.wikimedia.org/T315525) (owner: 10Tchanders) [14:20:27] (03CR) 10Clément Goubert: [C: 03+2] docker: Clean hiera following P:DockerEngine cleanup [puppet] - 10https://gerrit.wikimedia.org/r/829011 (https://phabricator.wikimedia.org/T316639) (owner: 10Clément Goubert) [14:21:08] (03PS1) 10AikoChou: ml-services: update outlink image to support pre-transformed data [deployment-charts] - 10https://gerrit.wikimedia.org/r/829015 (https://phabricator.wikimedia.org/T315998) [14:21:56] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the new cookbook!" [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 (owner: 10Muehlenhoff) [14:24:23] (03CR) 10Volans: [C: 03+1] "forgot one nit" [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 (owner: 10Muehlenhoff) [14:25:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1002.eqiad.wmnet [14:29:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1002.eqiad.wmnet [14:31:29] (03PS1) 10Giuseppe Lavagetto: gitlab: add gitlab::release::binary [puppet] - 10https://gerrit.wikimedia.org/r/829016 [14:32:17] (03CR) 10CI reject: [V: 04-1] gitlab: add gitlab::release::binary [puppet] - 10https://gerrit.wikimedia.org/r/829016 (owner: 10Giuseppe Lavagetto) [14:33:08] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:33:50] (03PS4) 10Muehlenhoff: Add a cookbook to perform a rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 [14:34:04] (03CR) 10Muehlenhoff: Add a cookbook to perform a rolling restart/reboot of an LDAP replica cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 (owner: 10Muehlenhoff) [14:35:28] RECOVERY - cassandra-c CQL 10.64.48.153:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.153 port 9042 https://phabricator.wikimedia.org/T93886 [14:37:30] (03PS2) 10Michael Große: Add config for redirect badges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827968 (https://phabricator.wikimedia.org/T316637) [14:38:36] 10SRE-swift-storage: Swift users and their usage - https://phabricator.wikimedia.org/T264291 (10LSobanski) Another potential usage is GitLab artifacts, adding @thcipriani for awareness. [14:39:07] (03PS3) 10Michael Große: Add config for redirect badges on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827968 (https://phabricator.wikimedia.org/T316637) [14:39:34] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829020 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [14:40:05] (03CR) 10Volans: "Thanks for writing a new cookbook!" [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:40:49] (03PS5) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [14:40:51] (03CR) 10Volans: [C: 03+1] "ship it" [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 (owner: 10Muehlenhoff) [14:41:22] (03CR) 10Hnowlan: thumbor: new service chart (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:42:21] 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Jelto) I don't want to over complicate the decommission of the service. But I was thinking about depooling the service first from confctl. Dep... [14:43:19] (03CR) 10Muehlenhoff: [C: 03+2] Add a cookbook to perform a rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/828950 (owner: 10Muehlenhoff) [14:43:54] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [14:46:24] (03PS1) 10Ssingh: trafficserver: remove deprecated config for ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/829022 (https://phabricator.wikimedia.org/T309651) [14:48:39] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37083/console" [puppet] - 10https://gerrit.wikimedia.org/r/829022 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:51:57] (03PS1) 10Muehlenhoff: sre.misc-clusters.thumbor: Switch to SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/829023 [14:53:36] 10SRE, 10ops-codfw, 10DBA: db2149 broken storage after reboot - https://phabricator.wikimedia.org/T316494 (10Papaul) 05Open→03Resolved Disk replaced ` Solid State Disk 0:1:5 Online 5 1787.88 GB Not Capable SATA SSD No 100% Operational State Rebuilding Progress 2% [14:54:31] !log Draining traffic from Lumen Tranport CCT 442550294 (cr1-codfw to cr4-ulsfo) ahead of hot-cut to lower-latency path with carrier [14:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:56] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:55:11] (03CR) 10CI reject: [V: 04-1] sre.misc-clusters.thumbor: Switch to SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/829023 (owner: 10Muehlenhoff) [14:56:56] (03PS1) 10Muehlenhoff: Remove sre.misc-clusters.sretest [cookbooks] - 10https://gerrit.wikimedia.org/r/829024 [14:57:36] (03PS2) 10Muehlenhoff: sre.misc-clusters.thumbor: Switch to SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/829023 [14:58:48] (03CR) 10Vlad.shapik: Fix environment in prep stage (032 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [14:59:24] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:01:48] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [15:01:51] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) I have finished compiling and packaging the branch with the fix, I have installed it on db1125 (test host) and will do... [15:02:32] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) Hi all -- I think it's time to pursue Option 2. Modify our mail servers to accept authenticated SMTP connections and use those for relaying f... [15:03:34] (03CR) 10Vgutierrez: [C: 03+1] "nice catch" [puppet] - 10https://gerrit.wikimedia.org/r/829022 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:06:11] (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: remove deprecated config for ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/829022 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:06:20] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [15:15:34] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters - https://phabricator.wikimedia.org/T315560 (10Volans) I had a chat with @JMeybohm on IRC and we went over some of the options/possibilities that we have he... [15:15:40] (03CR) 10Muehlenhoff: global: drop owner/group => root from file resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809139 (owner: 10Jbond) [15:18:30] (03CR) 10Volans: "I don't have a strong opinion, we could keep it as the minimal working example and have the example.txt as a full example of all capabilit" [cookbooks] - 10https://gerrit.wikimedia.org/r/829024 (owner: 10Muehlenhoff) [15:19:39] !log updating docker.io on ml-serve* to bugfix release from Bullseye 11.4 point release [15:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:56] (03PS3) 10Giuseppe Lavagetto: mediawiki::php: standardize php pool names (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/818077 [15:21:02] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:21:02] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:21:41] !log installing usb.ids update from Bullseye 11.4 point release [15:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:32] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [15:23:29] (03CR) 10Vlad.shapik: [C: 04-1] Fix environment in prep stage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [15:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:34:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:34:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:35:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:35:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:38:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: standardize php pool names (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/818077 (owner: 10Giuseppe Lavagetto) [15:45:05] (03PS1) 10David Caro: bullseye0: Add bullseye buildpack build/run images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829031 (https://phabricator.wikimedia.org/T316854) [15:49:20] (03PS1) 10Ssingh: trafficserver: send SIGUSR2 on service reload [puppet] - 10https://gerrit.wikimedia.org/r/829034 [15:50:30] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37085/console" [puppet] - 10https://gerrit.wikimedia.org/r/829034 (owner: 10Ssingh) [15:52:34] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Bonus Level 🕹️): git-fat replacement/removal - https://phabricator.wikimedia.org/T279509 (10EBernhardson) additional affected projects: * search/MjoLniR/deploy * search/airflow [15:53:46] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. One consideration is that this will likely prevent us using "dynamic neighbor" for these peerings at a later stage? I'm basing tha" [homer/public] - 10https://gerrit.wikimedia.org/r/827950 (owner: 10Ayounsi) [15:54:37] (03PS4) 10Giuseppe Lavagetto: mediawiki::php: standardize pool names (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/818078 [15:55:48] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase103[1-3].eqiad.wmnet [16:00:04] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:06:03] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37086/console" [puppet] - 10https://gerrit.wikimedia.org/r/818078 (owner: 10Giuseppe Lavagetto) [16:06:24] (03PS1) 10Cathal Mooney: Change CR template to only include Kubedse in BGP_Switch_in for eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/829037 (https://phabricator.wikimedia.org/T310174) [16:07:44] (03CR) 10Cathal Mooney: [C: 03+2] Change CR template to only include Kubedse in BGP_Switch_in for eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/829037 (https://phabricator.wikimedia.org/T310174) (owner: 10Cathal Mooney) [16:07:57] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:07:57] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:08:34] (03Merged) 10jenkins-bot: Change CR template to only include Kubedse in BGP_Switch_in for eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/829037 (https://phabricator.wikimedia.org/T310174) (owner: 10Cathal Mooney) [16:13:51] (03PS5) 10Giuseppe Lavagetto: mediawiki::php: standardize pool names (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/818078 [16:16:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "I will merge this change and apply it carefully across the infrastructure. I hope not to generate" [puppet] - 10https://gerrit.wikimedia.org/r/818078 (owner: 10Giuseppe Lavagetto) [16:16:59] (03PS1) 10ArielGlenn: Add Hannah Okwelum to icinga read access, remove Holger Knust [puppet] - 10https://gerrit.wikimedia.org/r/829038 (https://phabricator.wikimedia.org/T302145) [16:17:26] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=restbase103[1-3].eqiad.wmnet [16:20:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:24:32] ^^ above was doh1001, not sure what happened, CR brought session down due to BFD timeouts to the VM. [16:24:43] Re-established about 20 seconds later seems to be ok [16:25:24] We've seen this occasionally and unsure as to root cause (VM scheduling?), given the anycast nature of the DoH setup shouldn't affect service [16:25:36] topranks: one of those again :) [16:25:42] yeah [16:27:05] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/828960 (https://phabricator.wikimedia.org/T316867) (owner: 10Herron) [16:27:34] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [16:29:22] !log Brining Lumen Tranport CCT 442550294 (cr1-codfw to cr4-ulsfo) back into service following successful hot-cut to lower-latency path with carrier [16:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:08] (03PS2) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 [16:33:47] (03CR) 10CI reject: [V: 04-1] vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (owner: 10AOkoth) [16:35:41] (03PS3) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 [16:35:48] (03PS2) 10Ssingh: trafficserver: send SIGUSR2 on log rotation [puppet] - 10https://gerrit.wikimedia.org/r/829034 [16:36:39] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37087/console" [puppet] - 10https://gerrit.wikimedia.org/r/829034 (owner: 10Ssingh) [16:44:20] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/829023 (owner: 10Muehlenhoff) [16:53:18] (03CR) 10Volans: sre.k8s.pool-depool-cluster: Add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [17:00:04] bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T1700). [17:01:36] (03PS1) 10Clément Goubert: O:mediawiki::common: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) [17:01:53] (03CR) 10Vgutierrez: [C: 04-1] "nice catch, we just need to fix a few things before getting it merged" [puppet] - 10https://gerrit.wikimedia.org/r/829034 (owner: 10Ssingh) [17:02:52] (03CR) 10Ssingh: [V: 03+1] trafficserver: send SIGUSR2 on log rotation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829034 (owner: 10Ssingh) [17:04:36] (03CR) 10Ryan Kemper: [C: 03+2] admin: Update my home directory [puppet] - 10https://gerrit.wikimedia.org/r/828630 (owner: 10Ebernhardson) [17:04:51] (03PS2) 10Clément Goubert: O:mediawiki::common: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) [17:06:12] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37089/console" [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert) [17:06:34] (03PS3) 10Clément Goubert: O:mediawiki::common: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) [17:08:10] (03PS4) 10Clément Goubert: P:mediawiki::common: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) [17:09:50] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37090/console" [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert) [17:14:14] (03PS1) 10BryanDavis: developer-portal: Bump container to 2022-09-01-112116-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/829042 [17:15:08] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) Hi all, sorry for the second update today -- Qualtrics support suggested that we should set up the Custom From Domain (like the start of thi... [17:19:28] (03CR) 10Volans: [C: 04-2] "I don't think this code can be merged in its current form. Instead of doing a full code review I'll just highlight first the main issues f" [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere) [17:21:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:22:56] 🤔 [17:23:43] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2022-09-01-112116-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/829042 (owner: 10BryanDavis) [17:23:50] (03PS3) 10Ssingh: trafficserver: send SIGUSR2 on log rotation [puppet] - 10https://gerrit.wikimedia.org/r/829034 [17:24:42] (03CR) 10Ssingh: "Updated CR to address the comments. I went with the systemctl show --property MainPID method because it just feels cleaner but I have no s" [puppet] - 10https://gerrit.wikimedia.org/r/829034 (owner: 10Ssingh) [17:26:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:41] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37091/console" [puppet] - 10https://gerrit.wikimedia.org/r/829034 (owner: 10Ssingh) [17:26:46] !log restarted rsyslog on centrallog2002 [17:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:24] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2022-09-01-112116-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/829042 (owner: 10BryanDavis) [17:27:48] (03PS4) 10Ssingh: trafficserver: send SIGUSR2 on log rotation [puppet] - 10https://gerrit.wikimedia.org/r/829034 [17:28:51] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37092/console" [puppet] - 10https://gerrit.wikimedia.org/r/829034 (owner: 10Ssingh) [17:31:39] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:32:33] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:32:44] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:33:22] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:33:29] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:34:13] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:36:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:36:15] (03PS4) 10Giuseppe Lavagetto: mediawiki::php: standardize pool names (3/3) [puppet] - 10https://gerrit.wikimedia.org/r/818079 [17:43:01] (03PS5) 10Ssingh: trafficserver: send SIGUSR2 on log rotation [puppet] - 10https://gerrit.wikimedia.org/r/829034 [17:43:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: standardize pool names (3/3) [puppet] - 10https://gerrit.wikimedia.org/r/818079 (owner: 10Giuseppe Lavagetto) [17:44:00] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37093/console" [puppet] - 10https://gerrit.wikimedia.org/r/829034 (owner: 10Ssingh) [17:52:06] <_joe_> jouncebot: next [17:52:06] In 0 hour(s) and 7 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T1800) [17:52:10] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Jclark-ctr) db1196 E2 U36 Port41 CableId 23000071 db1197 E2 U37 Port42 CableId 23000035 db1198 E3 U36... [17:52:13] <_joe_> uhhh just now heh [17:52:19] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Jclark-ctr) [17:53:06] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Jclark-ctr) a:05Jclark-ctr→03Papaul [18:00:05] dduvall and hashar: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T1800). [18:07:40] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.novafullstack: Remove nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/814798 (owner: 10David Caro) [18:07:53] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:01] (03PS1) 10Clare Ming: Disable sticky header edit experiment for idwiki, viwki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829043 (https://phabricator.wikimedia.org/T315264) [18:08:09] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.novafullstack: stop sending stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/814800 (owner: 10David Caro) [18:09:09] (03CR) 10Muehlenhoff: [C: 04-1] "This check should rather be moved into the cpufrequtils class instead of checking it in the call sites." [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert) [18:11:09] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:14:00] (03PS3) 10Jdlrobson: Remove Vector grid config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616 (https://phabricator.wikimedia.org/T313559) (owner: 10Bernard Wang) [18:14:07] (03CR) 10Jdlrobson: "Clare will backport this today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616 (https://phabricator.wikimedia.org/T313559) (owner: 10Bernard Wang) [18:15:43] (03CR) 10Jdlrobson: [C: 03+1] "Patch looks fine tio me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829043 (https://phabricator.wikimedia.org/T315264) (owner: 10Clare Ming) [18:21:18] (03PS2) 10Clare Ming: Disable sticky header edit experiment for idwiki, viwki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829043 (https://phabricator.wikimedia.org/T315264) [18:22:39] (03CR) 10Clare Ming: Disable sticky header edit experiment for idwiki, viwki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829043 (https://phabricator.wikimedia.org/T315264) (owner: 10Clare Ming) [18:28:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:28:19] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:29:13] <_joe_> uhm this must be me [18:29:15] <_joe_> sorry [18:29:31] ah ok [18:30:19] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.40:80, 10.2.2.40:7443]) https://wikitech.wikimedia.org/wiki/PyBal [18:32:10] looks like labweb as well -- labweb.svc.eqiad.wmnet has address 10.2.2.40 [18:32:26] <_joe_> ueaj [18:32:28] <_joe_> *yeah [18:32:40] <_joe_> but wikitech is back [18:32:45] <_joe_> so I assume that is fixed too? [18:33:01] <_joe_> basically there's a misconfiguration on cloudweb that caused some automation to fail [18:33:18] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:33:18] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:33:37] <_joe_> herron: force a recheck, if it fails I'll take a further look at lv1019 [18:33:44] ok doing [18:33:59] <_joe_> thanks and sorry for the page [18:34:47] <_joe_> we might need to add better safeguards to the restart scripts [18:35:19] no worries, recheck forced, let's see if that clears now [18:35:21] <_joe_> basically the script was told to depool the servers from an inexistent lvs pool by configuration, so the script errored out [18:35:25] (03CR) 10Clare Ming: [C: 03+1] "yup - scheduled for next window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829043 (https://phabricator.wikimedia.org/T315264) (owner: 10Clare Ming) [18:35:31] <_joe_> so both servers got depooled [18:35:38] <_joe_> but not pooled back :P [18:35:48] (03PS1) 10TrainBranchBot: all wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829049 (https://phabricator.wikimedia.org/T314188) [18:35:49] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829049 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [18:36:15] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:36:20] (03CR) 10Clare Ming: [C: 03+1] Disable sticky header edit experiment for idwiki, viwki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829043 (https://phabricator.wikimedia.org/T315264) (owner: 10Clare Ming) [18:36:26] there we go [18:36:37] (03CR) 10Clare Ming: [C: 03+1] "yup - scheduled for next window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616 (https://phabricator.wikimedia.org/T313559) (owner: 10Bernard Wang) [18:36:57] ahh that's a fun trick haha [18:37:25] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829049 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [18:37:40] <_joe_> yeah well probably the safeguard is to check all pools for existence *before* we try to depool stuff [18:38:06] <_joe_> anyways, I should be done for the night, no more pages coming from me I hope! [18:39:44] ha, have a good night joe [18:42:19] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.27 refs T314188 [18:42:24] T314188: 1.39.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T314188 [18:42:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:45:39] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [18:46:37] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:48:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:48:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:48:21] !log pt1979@cumin1001 START - Cookbook sre.dns.netbox [18:49:24] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db1196 [18:49:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1196 [18:49:53] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db1197 [18:50:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1197 [18:50:06] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db1198 [18:50:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1198 [18:50:19] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db1199 [18:50:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1199 [18:50:42] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db1200 [18:50:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1200 [18:50:59] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db1201 [18:51:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1201 [18:51:13] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db1202 [18:51:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1202 [18:51:26] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db1203 [18:51:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1203 [18:51:41] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [18:51:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:51:59] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:52:08] !log pt1979@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:53:15] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host db1196.mgmt.eqiad.wmnet with reboot policy FORCED [18:53:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1197.mgmt.eqiad.wmnet with reboot policy FORCED [19:03:23] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:29] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:06:11] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [19:06:29] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:12:11] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [19:12:25] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:15:51] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [19:16:49] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1196.mgmt.eqiad.wmnet with reboot policy FORCED [19:16:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1197.mgmt.eqiad.wmnet with reboot policy FORCED [19:17:29] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host db1198.mgmt.eqiad.wmnet with reboot policy FORCED [19:17:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1199.mgmt.eqiad.wmnet with reboot policy FORCED [19:23:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:23:56] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Reedy) [19:24:15] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [19:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:28:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:28:31] (03PS1) 10Bking: elastic: prepare to add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/829052 (https://phabricator.wikimedia.org/T300943) [19:29:53] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/829052 (https://phabricator.wikimedia.org/T300943) (owner: 10Bking) [19:30:06] (03PS2) 10Ryan Kemper: elastic: prepare to add new codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/829052 (https://phabricator.wikimedia.org/T300943) (owner: 10Bking) [19:30:21] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [19:30:40] (03CR) 10Dzahn: [C: 04-2] "ah, no, we are still waiting for 3.5, not 3.4.5" [puppet] - 10https://gerrit.wikimedia.org/r/824222 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [19:32:22] (03CR) 10Hashar: "Indeed. That is still used for 3.4 and I had delays in testing and planning the 3.5 upgrade :-)" [puppet] - 10https://gerrit.wikimedia.org/r/824222 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [19:38:22] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37094/console" [puppet] - 10https://gerrit.wikimedia.org/r/829052 (https://phabricator.wikimedia.org/T300943) (owner: 10Bking) [19:41:29] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1198.mgmt.eqiad.wmnet with reboot policy FORCED [19:41:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1199.mgmt.eqiad.wmnet with reboot policy FORCED [19:44:53] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [19:46:01] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [19:46:12] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host db1200.mgmt.eqiad.wmnet with reboot policy FORCED [19:48:18] (03CR) 10Dzahn: "This is a remnant left from the pre-PHP parsoid service. The ticket still open was about removing those last remnants which made me find t" [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn) [19:48:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1201.mgmt.eqiad.wmnet with reboot policy FORCED [19:50:44] (03CR) 10Ryan Kemper: [C: 03+1] "PCC looks good: https://puppet-compiler.wmflabs.org/pcc-worker1003/37094/elastic2070.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/829052 (https://phabricator.wikimedia.org/T300943) (owner: 10Bking) [19:51:00] (03CR) 10Bking: [C: 03+2] elastic: prepare to add new codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/829052 (https://phabricator.wikimedia.org/T300943) (owner: 10Bking) [19:51:40] (03CR) 10Dzahn: [C: 03+2] "I got a report from RhinosF1 that we got the alert again. I made for it https://phabricator.wikimedia.org/T316903" [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede) [19:53:56] (03CR) 10Dzahn: [C: 03+2] Revert "c:spamassassin move Spamassassin updates from crontab" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [19:58:53] !log otrs1001 - sudo systemctl reset-failed - T316903 [19:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:59] T316903: vrts - spamassassin icinga alerts - https://phabricator.wikimedia.org/T316903 [19:59:04] (03PS1) 10Bking: elastic: bring new hosts into elastic cluster [puppet] - 10https://gerrit.wikimedia.org/r/829056 (https://phabricator.wikimedia.org/T300943) [20:00:04] brennen: OwO what's this, a deployment window?? UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220901T2000). nyaa~ [20:00:04] danisztls, ebernhardson, and cjming: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:13] o/ [20:00:24] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/829056 (https://phabricator.wikimedia.org/T300943) (owner: 10Bking) [20:00:37] o/ [20:00:39] o/ [20:01:30] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37095/console" [puppet] - 10https://gerrit.wikimedia.org/r/829056 (https://phabricator.wikimedia.org/T300943) (owner: 10Bking) [20:02:01] (03CR) 10Dzahn: [C: 03+2] Switch to deployment-urldownloader03 [puppet] - 10https://gerrit.wikimedia.org/r/828790 (https://phabricator.wikimedia.org/T278641) (owner: 10Zabe) [20:02:22] (03PS2) 10Thcipriani: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828614 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [20:03:07] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+1] elastic: bring new hosts into elastic cluster [puppet] - 10https://gerrit.wikimedia.org/r/829056 (https://phabricator.wikimedia.org/T300943) (owner: 10Bking) [20:03:18] (03CR) 10Bking: [C: 03+2] elastic: bring new hosts into elastic cluster [puppet] - 10https://gerrit.wikimedia.org/r/829056 (https://phabricator.wikimedia.org/T300943) (owner: 10Bking) [20:03:22] \o [20:05:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828614 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [20:06:22] (03Merged) 10jenkins-bot: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828614 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [20:06:39] getting started using our fancy new "scap backport" command [20:06:46] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:828614|Deploy Research Incentive Survey to idwiki (T316466)]] [20:06:51] T316466: Deploy Research Incentive Survey on Indonesian Wikipedia - https://phabricator.wikimedia.org/T316466 [20:07:03] (03CR) 10Dzahn: [C: 03+1] "I'm sure there are some things to change or improve in this but I think it's a good idea to first start with a simple change like this tha" [puppet] - 10https://gerrit.wikimedia.org/r/828673 (owner: 10AOkoth) [20:07:49] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:07:56] (03CR) 10Dzahn: [C: 03+1] "nitpick: please link to a ticket" [puppet] - 10https://gerrit.wikimedia.org/r/828673 (owner: 10AOkoth) [20:11:29] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:12:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:13:02] !log thcipriani@deploy1002 thcipriani and dani: Backport for [[gerrit:828614|Deploy Research Incentive Survey to idwiki (T316466)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:13:06] T316466: Deploy Research Incentive Survey on Indonesian Wikipedia - https://phabricator.wikimedia.org/T316466 [20:13:31] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1200.mgmt.eqiad.wmnet with reboot policy FORCED [20:13:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1201.mgmt.eqiad.wmnet with reboot policy FORCED [20:13:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:13:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:13:59] danisztls: your change is on mwdebug1002, check please [20:14:26] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host db1202.mgmt.eqiad.wmnet with reboot policy FORCED [20:14:26] (03PS13) 10Thcipriani: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [20:14:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:14:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1203.mgmt.eqiad.wmnet with reboot policy FORCED [20:15:58] thcipriani: tested, some translations are missing, can you revert it? [20:17:26] danisztls: sure, thanks for testing, I can revert [20:17:43] (03PS1) 10TrainBranchBot: Revert "Deploy Research Incentive Survey to idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829061 [20:18:15] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:18:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829061 (owner: 10TrainBranchBot) [20:18:32] thcipriani: thanks [20:18:33] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:19:13] (03Merged) 10jenkins-bot: Revert "Deploy Research Incentive Survey to idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829061 (owner: 10TrainBranchBot) [20:19:28] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:829061|Revert "Deploy Research Incentive Survey to idwiki"]] [20:19:43] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:20:03] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10Andrew) 05Open→03Resolved [20:20:08] !log thcipriani@deploy1002 thcipriani and trainbranchbot: Backport for [[gerrit:829061|Revert "Deploy Research Incentive Survey to idwiki"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:20:09] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) [20:20:51] !log thcipriani@deploy1002 sync-world aborted: Backport for [[gerrit:829061|Revert "Deploy Research Incentive Survey to idwiki"]] (duration: 01m 23s) [20:20:51] !log thcipriani@deploy1002 backport aborted: (duration: 02m 57s) [20:20:51] !log thcipriani@deploy1002 backport aborted: (duration: 03m 09s) [20:21:00] danisztls: should be reverted [20:21:17] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:21:23] thcipriani: that TrainBranchBot revert process seems slick, but it is not leaving the common "this commit was reverted" linkage that one gets from using the gerrit UI. Might be a nice thing to try and figure out how to add. [20:22:06] Or at least include the hash of the commit being reverted in the commit message so its easier to follow in the reverse direction [20:22:15] (03PS1) 10Dduvall: TRY promote script to revert permissions for clean up [puppet] - 10https://gerrit.wikimedia.org/r/829062 [20:22:17] (03PS1) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) [20:22:25] hrm, yeah, just a git revert it seems :) [20:22:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [20:22:37] bd808: Thanks for the feedback. We can probably do something about that. [20:22:56] (but the command of "scap backport --revert " is magic :D) [20:23:01] (03Abandoned) 10Dduvall: TRY promote script to revert permissions for clean up [puppet] - 10https://gerrit.wikimedia.org/r/829062 (owner: 10Dduvall) [20:23:32] I saw how fast it all happened and was amazed. congrats to dancy and anyone else who has built that. :) [20:23:52] (03Merged) 10jenkins-bot: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [20:24:06] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:824787|cirrus: Handle transition to elasticsearch 7.10]] [20:24:08] Thank you! Jeena was a major contributor. [20:24:29] (03PS2) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) [20:24:32] !log thcipriani@deploy1002 thcipriani and ebernhardson: Backport for [[gerrit:824787|cirrus: Handle transition to elasticsearch 7.10]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:24:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:25:10] jeena: wiki<3 to you too then. better deploy tools make things better for everyone :) [20:25:13] ^ ebernhardson on mwdebug, check please [20:26:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:26:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:26:19] bd808: Thanks and thanks for the suggestion! :) [20:26:26] thcipriani: looking [20:26:39] thcipriani: which mwdebug? [20:27:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:27:12] ebernhardson: all of them :) [20:27:21] mwdebug1002 should work [20:27:45] thcipriani: ahh :) Looks to work, as expected nothing changes. The magic happens (and hopefully doesn't break anything) when testwiki rolls forward next week [20:27:49] s/testwiki/group0/ [20:27:55] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:28:41] ebernhardson: gotcha, going live [20:29:05] (03CR) 10Dduvall: phabricator: Deploy user should own everything under old rev directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [20:29:21] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:30:24] PROBLEM - Check systemd state on elastic2080 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:36] PROBLEM - Check systemd state on elastic2074 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:00] PROBLEM - Check systemd state on elastic2081 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:06] PROBLEM - Check systemd state on elastic2075 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:16] ^ those are new servers coming into the cluster, nothing important [20:31:35] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Dzahn) [20:31:35] phew [20:31:46] ^ indeed, kicking off another puppet run to clear those elastic alerts [20:31:49] good to know I'm not in the process of breaking something :D [20:31:57] (sorry, should have downtimed those hosts first) [20:32:13] yea the timing was interesting :) [20:32:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:32:30] thcipriani: but imagine if you got to get one of those neat shirts! :P [20:32:50] ryankemper: I have enough shirts to last a lifetime [20:33:10] (one) [20:33:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:33:16] xD [20:33:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:34:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:34:51] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Dzahn) @MoritzMuehlenhoff @Krinkle I made procurement subtasks. There are 2 because the template says it needs to be limited to a specific DC. Please take a look if you... [20:35:13] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 14 hosts with reason: T300943 [20:35:17] T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943 [20:35:35] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: T300943 [20:37:10] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=yes; selector: name=elastic20[73-86].* [20:38:10] RECOVERY - Check systemd state on elastic2080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:30] RECOVERY - Check systemd state on elastic2074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:06] RECOVERY - Check systemd state on elastic2081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:10] RECOVERY - Check systemd state on elastic2075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:13] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1202.mgmt.eqiad.wmnet with reboot policy FORCED [20:39:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1203.mgmt.eqiad.wmnet with reboot policy FORCED [20:40:15] !log T300943 New hosts are in service and were pooled like so: `sudo confctl select name=elastic20[73-86].* set/weight=10:pooled=yes` (in retrospect that syntax seems to have selected too many hosts, but the final state of pybal is correct per https://config-master.wikimedia.org/pybal/codfw/search) [20:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:20] T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943 [20:41:02] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:824787|cirrus: Handle transition to elasticsearch 7.10]] (duration: 16m 56s) [20:41:13] ^ ebernhardson should be live [20:41:17] (03PS4) 10Clare Ming: Remove Vector grid config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616 (https://phabricator.wikimedia.org/T313559) (owner: 10Bernard Wang) [20:41:30] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:42:32] (03PS3) 10Thcipriani: Disable sticky header edit experiment for idwiki, viwki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829043 (https://phabricator.wikimedia.org/T315264) (owner: 10Clare Ming) [20:43:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616 (https://phabricator.wikimedia.org/T313559) (owner: 10Bernard Wang) [20:43:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829043 (https://phabricator.wikimedia.org/T315264) (owner: 10Clare Ming) [20:43:55] (03Merged) 10jenkins-bot: Remove Vector grid config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616 (https://phabricator.wikimedia.org/T313559) (owner: 10Bernard Wang) [20:43:59] (03Merged) 10jenkins-bot: Disable sticky header edit experiment for idwiki, viwki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829043 (https://phabricator.wikimedia.org/T315264) (owner: 10Clare Ming) [20:44:14] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:828616|Remove Vector grid config (T313559)]], [[gerrit:829043|Disable sticky header edit experiment for idwiki, viwki (T315264)]] [20:44:20] T313559: Remove Vector grid feature flagging code - https://phabricator.wikimedia.org/T313559 [20:44:20] T315264: Disable sticky header edit A/B test for idwiki + viwiki - https://phabricator.wikimedia.org/T315264 [20:44:42] !log thcipriani@deploy1002 thcipriani and cjming and bwang: Backport for [[gerrit:828616|Remove Vector grid config (T313559)]], [[gerrit:829043|Disable sticky header edit experiment for idwiki, viwki (T315264)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:47:29] (03CR) 10Dzahn: phabricator: Deploy user should own everything under old rev directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [20:48:14] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:49:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:49:58] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:828616|Remove Vector grid config (T313559)]], [[gerrit:829043|Disable sticky header edit experiment for idwiki, viwki (T315264)]] (duration: 05m 44s) [20:50:04] T313559: Remove Vector grid feature flagging code - https://phabricator.wikimedia.org/T313559 [20:50:04] T315264: Disable sticky header edit A/B test for idwiki + viwiki - https://phabricator.wikimedia.org/T315264 [20:50:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:50:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:51:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:51:40] scap backport gets better every time I use it [20:52:02] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:55:26] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:58:08] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:01:54] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:20:12] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:20] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:32:40] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:33:08] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:34:26] PROBLEM - Disk space on ms-be2037 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): /tmp 0 MB (0% inode=89%): /var/tmp 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2037&var-datasource=codfw+prometheus/ops [21:36:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [21:37:12] (03CR) 10Dzahn: "Jelto's suggestion to first depool it is also fine" [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [21:42:50] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:44:50] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:05:07] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:19:59] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:23:51] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:25:33] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:45:01] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:59:35] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:09:17] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:20:33] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:57] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:24:19] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Reedy) [23:24:53] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:32:09] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:36:03] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:39:27] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:50:09] PROBLEM - k8s requests count to the API on ml-serve-ctrl1001 is CRITICAL: 102.5 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [23:53:15] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state