[00:40:59] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:07:45] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics, 10User-Ladsgroup: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10SWakiyama) Approved! [01:10:53] (03PS1) 10Ladsgroup: admin: Add bwang to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/759845 (https://phabricator.wikimedia.org/T300664) [01:13:21] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics, 10User-Ladsgroup: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10Ladsgroup) [01:14:49] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:18:16] (03PS1) 10Ladsgroup: admin: Adding AUgolnikova [puppet] - 10https://gerrit.wikimedia.org/r/759846 (https://phabricator.wikimedia.org/T300878) [01:41:45] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:48:07] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.90 ms [02:13:09] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:43:35] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:22:55] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:44:21] (03CR) 10Krinkle: varnish: Replace "Expires" in Set-Cookie with "Max-Age" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637851 (https://phabricator.wikimedia.org/T147967) (owner: 10Ladsgroup) [04:57:45] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:09:35] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:12:01] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [05:19:11] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:19:13] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [05:33:27] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:41:37] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [05:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:57] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:09:28] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [06:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [06:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:19] (03PS1) 10Andrew Bogott: cloudvirts: remove some buster-only packages from bullseye [puppet] - 10https://gerrit.wikimedia.org/r/759848 [06:56:13] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirts: remove some buster-only packages from bullseye [puppet] - 10https://gerrit.wikimedia.org/r/759848 (owner: 10Andrew Bogott) [06:59:44] (03PS1) 10Andrew Bogott: cloudvirt: remove virt-top from bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/759849 [07:00:41] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt: remove virt-top from bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/759849 (owner: 10Andrew Bogott) [07:42:40] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jsn.sherman - https://phabricator.wikimedia.org/T296654 (10Aklapper) @herron: As I am not a member of #acl_sre-team I cannot access https://phabricator.wikimedia.org/project/edit/956/ - feel free to do so (also for https://phabricator.wikimedia.org/projec... [07:43:41] 10SRE, 10Kubernetes, 10discovery-system: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10Aklapper) a:03Joe @Joe: Thanks in advance! [08:17:16] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:19:38] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:29:21] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10ssr) Who are in favor of DPL, please express support here https://meta.wikimedia.org/wiki/Community_Wishlist_Su... [09:30:46] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:58:44] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:17:56] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:23:54] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:58:58] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:25:06] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:38:10] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missi [15:38:10] [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:29:58] (03PS1) 10Andrew Bogott: cloudvirt/libvirtd: try to fix service startup order [puppet] - 10https://gerrit.wikimedia.org/r/759893 [16:30:42] (03CR) 10jerkins-bot: [V: 04-1] cloudvirt/libvirtd: try to fix service startup order [puppet] - 10https://gerrit.wikimedia.org/r/759893 (owner: 10Andrew Bogott) [16:49:03] 10SRE, 10Traffic-Icebox: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Mitar) I think the only viable solution here is that instead of increasing limits, is to allow payload in a request body. ElasticsSearch uses body in GET requests, which is not standard and one could... [16:54:49] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [16:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:33] 10SRE, 10Traffic-Icebox: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Mitar) Oh, POST works already instead of GET. This should be documented somewhere in https://www.mediawiki.org/wiki/API:Main_page. Because when I read documentation like https://www.mediawiki.org/wik... [17:27:40] (03PS2) 10Andrew Bogott: cloudvirt/libvirtd: try to fix service startup order [puppet] - 10https://gerrit.wikimedia.org/r/759893 [17:35:21] (03PS1) 10MSantos: maps: remove tilerator logic from planet_sync [puppet] - 10https://gerrit.wikimedia.org/r/759894 [17:35:56] (03CR) 10jerkins-bot: [V: 04-1] maps: remove tilerator logic from planet_sync [puppet] - 10https://gerrit.wikimedia.org/r/759894 (owner: 10MSantos) [17:40:04] (03PS1) 10Andrew Bogott: cloudvirt/libvirtd: try to fix service startup order [puppet] - 10https://gerrit.wikimedia.org/r/759895 [17:46:42] (03Abandoned) 10Andrew Bogott: cloudvirt/libvirtd: try to fix service startup order [puppet] - 10https://gerrit.wikimedia.org/r/759893 (owner: 10Andrew Bogott) [17:46:46] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt/libvirtd: try to fix service startup order [puppet] - 10https://gerrit.wikimedia.org/r/759895 (owner: 10Andrew Bogott) [17:53:34] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [17:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:45] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [18:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:45] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [19:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:38] (03PS1) 10Majavah: planet: remove Planet Apache links [puppet] - 10https://gerrit.wikimedia.org/r/759922 [20:15:38] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [20:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:46] (03PS1) 10Andrew Bogott: cloudvirt/libvirtd: another attempt to fix startup ordering [puppet] - 10https://gerrit.wikimedia.org/r/759935 [20:30:34] (03CR) 10jerkins-bot: [V: 04-1] cloudvirt/libvirtd: another attempt to fix startup ordering [puppet] - 10https://gerrit.wikimedia.org/r/759935 (owner: 10Andrew Bogott) [20:33:02] (03PS2) 10Andrew Bogott: cloudvirt/libvirtd: another attempt to fix startup ordering [puppet] - 10https://gerrit.wikimedia.org/r/759935 [20:33:40] (03CR) 10jerkins-bot: [V: 04-1] cloudvirt/libvirtd: another attempt to fix startup ordering [puppet] - 10https://gerrit.wikimedia.org/r/759935 (owner: 10Andrew Bogott) [20:36:12] (03PS1) 10MSantos: tegola: prepare eqiad cluster for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/759938 [20:36:22] (03PS3) 10Andrew Bogott: cloudvirt/libvirtd: another attempt to fix startup ordering [puppet] - 10https://gerrit.wikimedia.org/r/759935 [20:38:38] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt/libvirtd: another attempt to fix startup ordering [puppet] - 10https://gerrit.wikimedia.org/r/759935 (owner: 10Andrew Bogott) [20:40:52] (03PS1) 10Andrew Bogott: cloudvirt: fully/quality/path to systemctl [puppet] - 10https://gerrit.wikimedia.org/r/759941 [20:41:56] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt: fully/quality/path to systemctl [puppet] - 10https://gerrit.wikimedia.org/r/759941 (owner: 10Andrew Bogott) [20:51:13] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:00:09] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:21:21] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:28:27] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:28:37] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2003-dev.codfw.wmnet with OS bullseye [21:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:03] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:52:30] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:01:28] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:02:48] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:05:54] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:10:04] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2003-dev.codfw.wmnet with OS bullseye [22:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:54] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:21:08] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:26:55] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:34:51] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:47:25] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:51:41] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:56:21] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:56:29] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:07:59] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:08:23] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:17:43] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [23:22:29] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [23:24:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:32:05] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:41:03] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [23:48:47] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:55:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:59:45] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28