[00:08:29] 10Tool-wikiqanda, 06Future-Audiences: [Bug] Investigate issues from internal testing - https://phabricator.wikimedia.org/T380799#10355908 (10derenrich) a:03derenrich https://gitlab.wikimedia.org/repos/future-audiences/wikichat/-/merge_requests/26 addresses some of it [02:26:56] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:30:21] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [02:31:56] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:35:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [03:55:22] 10tool-wscontest, 07good first task: Make all interface messages translatable - https://phabricator.wikimedia.org/T346994#10356159 (10Samwilson) Thanks for the PR! > I have a question, for the text Index Pages:, is it okay to do {{ msg('index-pages') }}: [1] as index-pages already exists and is also used som... [04:58:50] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:03:50] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:04:27] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827 (10Andrew) 03NEW [05:07:06] 06cloud-services-team, 10Toolforge: letsencrypt issues on tools-nfs-2 - https://phabricator.wikimedia.org/T380829 (10Andrew) 03NEW [05:11:40] 06cloud-services-team, 10Toolforge: letsencrypt issues on tools-nfs-2 - https://phabricator.wikimedia.org/T380829#10356239 (10Andrew) This same error is present throughout tools: tools-cumin-1.tools.eqiad1.wikimedia.cloud,tools-elastic-[4-6].tools.eqiad1.wikimedia.cloud,tools-harbor-1.tools.eqiad1.wikimedia.cl... [05:45:28] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [05:47:28] FIRING: InstanceDown: Project tools instance tools-sgebastion-10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:52:28] RESOLVED: InstanceDown: Project tools instance tools-sgebastion-10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [06:15:21] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [06:15:51] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [06:45:34] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers (T380827) [06:47:48] T380827: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827 [06:49:16] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10356296 (10Andrew) I'm rebooting nfs nodes via the cookbook. Multiple people are seeing intermittent dns errors; I'm not sure how they can be related but this seems like a good first step. [06:55:51] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [07:05:21] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [07:10:21] FIRING: [2x] MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [07:21:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:28:52] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all NFS workers (T380827) [07:31:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:36:04] 06cloud-services-team, 10Toolforge: jobs-api crashing - https://phabricator.wikimedia.org/T380832#10356349 (10Andrew) This seems to be resolved now, pending questions are: - why no alerts? - are the docs as wrong as the look to me at 2AM? [08:24:53] 06cloud-services-team, 10Toolforge: [harbor] some artifacts and projects seems to have gone missing - https://phabricator.wikimedia.org/T380833 (10Slst2020) 03NEW [08:33:12] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-61 [08:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:34:24] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-61 [08:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:07:28] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17 [09:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:08:41] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-17 [09:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:11:25] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-50 [09:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:12:31] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-50 [09:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:12:48] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-70 [09:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:13:54] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-70 [09:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:14:26] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-72 [09:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:15:35] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-72 [09:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:19:46] 10PAWS: [bug] - https://phabricator.wikimedia.org/T380834 (10Ravi7453) 03NEW [09:36:42] 06cloud-services-team, 10Toolforge, 07Kubernetes: DNS errors on toolforge kubernetes - https://phabricator.wikimedia.org/T380837 (10Count_Count) 03NEW [10:17:24] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-8 [10:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:18:07] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-control-8 [10:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:18:50] 06cloud-services-team, 10Toolforge: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844 (10Slst2020) 03NEW [10:20:21] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [10:22:22] 06cloud-services-team, 10Toolforge: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10356723 (10hashar) [10:22:24] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-9 [10:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:23:05] 06cloud-services-team, 10Toolforge, 07Kubernetes: DNS errors on toolforge kubernetes - https://phabricator.wikimedia.org/T380837#10356711 (10Slst2020) This is related to {T380844} [10:23:08] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-control-9 [10:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:23:58] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure, 10ci-test-error (WMF-deployed Build Failure), 10Release-Engineering-Team (Seen): Various CI jobs failing with: Could not resolve host: gerrit.wikimedia.org - https://phabricator.wikimedia.org/T374830#10356713 (10hashar) 05O... [10:30:37] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-7 [10:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:31:20] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-control-7 [10:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:31:58] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-7 [10:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:32:31] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-control-7 [10:32:32] 06cloud-services-team, 10Toolforge, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10356757 (10Peachey88) [10:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:33:14] FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-control-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [10:33:41] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-7 [10:34:11] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-control-7 [10:35:22] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-7 [10:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:36:21] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-control-7 [10:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:38:14] FIRING: [2x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-control-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [10:40:59] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-7 [10:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:42:23] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-control-7 [10:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:43:14] RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-control-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [10:44:09] 06cloud-services-team, 10Toolforge, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10356793 (10hashar) [11:09:52] (03PS1) 10David Caro: toolforge.k8s.reboot: swap the control node if it's the one to reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1097991 (https://phabricator.wikimedia.org/T380844) [11:12:09] 06cloud-services-team, 10Toolforge, 13Patch-For-Review, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10356999 (10dcaro) p:05Triage→03High [11:18:41] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T380426#10357057 (10Curb_Safe_Charmer) 05Open→03Resolved [11:21:43] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] toolforge.k8s.reboot: swap the control node if it's the one to reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1097991 (https://phabricator.wikimedia.org/T380844) (owner: 10David Caro) [11:23:12] 10PAWS: [bug]  - https://phabricator.wikimedia.org/T380834#10357091 (10dcaro) 05Open→03Declined Forgot to fill up I guess [11:25:13] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.... [11:29:08] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephmon1004.eq... [12:03:27] 06cloud-services-team, 10Toolforge: jobs-api crashing - https://phabricator.wikimedia.org/T380832#10357175 (10aborrero) [12:04:29] 06cloud-services-team, 10Toolforge: jobs-api crashing - https://phabricator.wikimedia.org/T380832#10357169 (10aborrero) p:05Triage→03Low >>! In T380832#10356349, @Andrew wrote: > This seems to be resolved now, pending questions are: > > - why no alerts? Were jobs-api pods crashing? I think the monitoring... [12:15:52] 06cloud-services-team, 10Toolforge, 13Patch-For-Review, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10357206 (10aborrero) [12:15:55] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357207 (10aborrero) [12:23:57] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357238 (10aborrero) p:05Triage→03High Regarding why NFS stopped responding, I did some quick research. I can see some log entries: ` Nov 26 02:27:50 tools-sgebastion-10 kernel: [9562349.633512] n... [12:24:39] 06cloud-services-team, 10Toolforge: jobs-api crashing - https://phabricator.wikimedia.org/T380832#10357247 (10Slst2020) >>! In T380832#10357169, @aborrero wrote: > I think the actual problem was {T380844}. Yes, I can confirm this: ` sed by NameResolutionError( " !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.vm_console [12:34:16] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [12:34:29] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.vm_console [12:34:31] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [12:35:00] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.vm_console [12:35:03] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [12:40:58] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.vm_console [12:41:03] !log aborrero@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [12:48:09] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357359 (10aborrero) A couple of minutes before the nfs server was reported as not responding, the neutron-openvswith-agent running on the cloudvirt hosting the nfs server had a problem: ` Nov 26 02:23... [12:50:33] 06cloud-services-team, 10Toolforge: letsencrypt issues on tools-nfs-2 - https://phabricator.wikimedia.org/T380829#10357365 (10aborrero) p:05Triage→03Medium [12:51:43] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357364 (10dcaro) > So my theory is maybe a ceph network hiccup? There's no traffic interruption, errors spike or drops spike on the (cloudsw) switches, nor flips on the ceph health/degraded objects da... [12:55:55] 06cloud-services-team: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357382 (10aborrero) error is ` Nov 23 11:28:51 cloudvirt1061 kernel: Memory failure: 0x4fc0380: unhandlable page. ` [13:03:58] (03CR) 10David Caro: [C:03+2] toolforge.k8s.reboot: swap the control node if it's the one to reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1097991 (https://phabricator.wikimedia.org/T380844) (owner: 10David Caro) [13:04:26] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357417 (10aborrero) supporting the theory of a some kind of general openstack network problems, openvswitch failed in pretty much all the cloudvirts more or less at the same time: {P71184} [13:04:44] 10tool-wscontest, 07good first task: Add UTC in the WSContest contest page - https://phabricator.wikimedia.org/T331225#10357444 (10AS1100K) 05Open→03Resolved [13:07:37] (03Merged) 10jenkins-bot: toolforge.k8s.reboot: swap the control node if it's the one to reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1097991 (https://phabricator.wikimedia.org/T380844) (owner: 10David Caro) [13:09:47] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357460 (10dcaro) An upstream says its a "harmless log message" https://bugzilla.redhat.com/show_bug.cgi?id=1506035 [13:10:10] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357462 (10aborrero) My current theory is that there was rollout of a puppet change, that restarted openvswitch across all hypervisors, causing a brief network outage, that was magnified by NFS: ` Nov... [13:14:42] 06cloud-services-team: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357469 (10aborrero) [13:15:13] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.... [13:16:34] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357472 (10dcaro) This one never got resolved https://bugs.launchpad.net/neutron/+bug/1868098 :/ > My current theory is that there was rollout of a puppet change, that restarted openvswitch across all... [13:17:12] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357474 (10aborrero) p:05Triage→03Medium hey @Jhancock.wm @Jclark-ctr Do you know if this is concerning, and if we should be taking proactive acti... [13:17:37] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357479 (10aborrero) a:03Jhancock.wm [13:21:12] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357487 (10aborrero) >>! In T380827#10357472, @dcaro wrote: > >> My current theory is that there was rollout of a puppet change, that restarted openvswitch across all hypervisors, causing a brief netwo... [13:33:39] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357549 (10dcaro) 05Open→03Resolved [13:33:57] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10357556 (10dcaro) [13:34:43] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10357557 (10dcaro) 05Open→03Resolved Last node added [13:35:45] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357545 (10dcaro) 05Resolved→03Open Node up and running [13:40:31] 10PAWS: openrefine in PAWS fails silently to upload new WD item - https://phabricator.wikimedia.org/T380737#10357599 (10rook) Thank you @Spinster. @So9q are you able to reproduce? [13:50:50] 06cloud-services-team, 10Toolforge (Toolforge iteration 16): [harbor] Do not clean up images currently running in production - https://phabricator.wikimedia.org/T377854#10357670 (10Raymond_Ndibe) 05In progress→03Resolved [13:52:24] 10cloud-services-team (FY2024/2025-Q1-Q2), 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10357672 (10elukey) Reached out to Joanna to confirm the user to the group, but it LGTM. [13:52:28] 10Toolforge (Toolforge iteration 16): [lima-kilo] allow for the creation of a multi-node high availability cluster - https://phabricator.wikimedia.org/T374585#10357665 (10Raymond_Ndibe) 05In progress→03Resolved [13:53:23] 10Toolforge (Toolforge iteration 16): lima-kilo installation giving inconsistent result. Sometimes it works, sometimes it doesn't - https://phabricator.wikimedia.org/T375163#10357676 (10Raymond_Ndibe) 05Open→03Resolved [13:55:38] 10Toolforge (Toolforge iteration 16), 13Patch-For-Review: [lima-kilo] support caching of container images using a cache disk - https://phabricator.wikimedia.org/T378180#10357667 (10Raymond_Ndibe) 05In progress→03Resolved [14:02:16] 10cloud-services-team (FY2024/2025-Q1-Q2), 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10357729 (10fnegri) 05Open→03Stalled Joanna is out sick, but I discussed this with her and we have a team-wide meeting... [14:02:28] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357735 (10Jclark-ctr) @aborrero i have updated Idrac firmware. I assume Dell will want me to update bios firmware which will require rebo... [14:51:44] 06cloud-services-team, 10Toolforge: [harbor] some artifacts and projects seems to have gone missing - https://phabricator.wikimedia.org/T380833#10357916 (10Raymond_Ndibe) [14:59:14] FIRING: Kernel error: Server cloudcephmon1004 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudcephmon1004 - https://alerts.wikimedia.org/?q=alertname%3DKernel+error [14:59:18] 06cloud-services-team: Kernel error Server cloudcephmon1004 may have kernel errors - https://phabricator.wikimedia.org/T380877 (10phaultfinder) 03NEW [14:59:19] FIRING: Kernel warning: Server cloudcephmon1004 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudcephmon1004 - https://alerts.wikimedia.org/?q=alertname%3DKernel+warning [15:12:22] !log rook@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for main branch [15:12:44] !log rook@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for main branch [15:13:52] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357951 (10Andrew) Were there signs of dns/network failures outside of toolforge/k8s containers? I wasn't able to find any last night when troubleshooting. [15:13:58] !log rook@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/145 [15:14:26] !log rook@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/145 [15:14:29] 06cloud-services-team: Kernel error Server cloudcephmon1004 may have kernel errors - https://phabricator.wikimedia.org/T380877#10357953 (10dcaro) Current errors: ` root@cloudcephmon1004:~# journalctl -k -p err -- Journal begins at Tue 2024-11-26 12:46:45 UTC, ends at Tue 2024-11-26 15:12:54 UTC. -- Nov 26 13:11:... [15:15:06] (03merge) 10rook: Add pawsdev to codfw1dev [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/145 (https://phabricator.wikimedia.org/T380794) [15:15:24] !log rook@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for main branch [15:15:49] !log rook@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for main branch [15:15:56] !log rook@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [15:16:31] !log rook@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [15:20:04] 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10357974 (10VRiley-WMF) Has this still been performing as expected? If so, are we able to close it? [15:20:29] 06cloud-services-team: Kernel error Server cloudcephmon1004 may have kernel errors - https://phabricator.wikimedia.org/T380877#10357979 (10dcaro) The first is expected, the second seems harmless too: https://hetzbiz.cloud/2024/06/11/those-damn-mpt3sas_cm0-messages/ [15:22:33] 10PAWS: pawsdev in codfw1dev - https://phabricator.wikimedia.org/T380794#10357980 (10rook) 05Open→03Resolved [15:30:05] 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10357993 (10dcaro) Looks good on my side 👍 [15:33:30] 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10357997 (10VRiley-WMF) 05Open→03Resolved [15:39:10] FIRING: ProjectProxyMainProxyDown: Proxy on proxy-04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyDown [15:44:10] RESOLVED: ProjectProxyMainProxyDown: Proxy on proxy-04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyDown [15:55:59] 06cloud-services-team, 10Toolforge, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10358073 (10JJMC89) [15:59:14] 06cloud-services-team, 10Toolforge, 07Kubernetes: DNS errors on toolforge kubernetes - https://phabricator.wikimedia.org/T380837#10358071 (10JJMC89) →14Duplicate dup:03T380844 [16:14:17] 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10358138 (10cmooney) 05Resolved→03Open >>! In T380503#10357974, @VRiley-WMF wrote: > Has this still... [16:15:06] 06cloud-services-team: 2024-11-16 openstack network problems - https://phabricator.wikimedia.org/T380882 (10aborrero) 03NEW [16:15:16] 06cloud-services-team: 2024-11-16 openstack network problems - https://phabricator.wikimedia.org/T380882#10358185 (10aborrero) [16:15:19] 06cloud-services-team: 2024-11-16 openstack network problems - https://phabricator.wikimedia.org/T380882#10358187 (10aborrero) [16:15:25] 06cloud-services-team, 10Toolforge, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10358186 (10aborrero) [16:15:26] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10358188 (10aborrero) [16:15:40] 06cloud-services-team: 2024-11-16 openstack network problems - https://phabricator.wikimedia.org/T380882#10358189 (10aborrero) p:05Triage→03High [16:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:21:06] 06cloud-services-team, 10Cloud-VPS: 2024-11-16 openstack network problems - https://phabricator.wikimedia.org/T380882#10358222 (10fnegri) [16:24:56] 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10358238 (10aborrero) >>! In T380827#10357951, @Andrew wrote: > Were there signs of dns/network failures outside of toolforge/k8s containers? I wasn't able to find any last night when troubleshooting. W... [16:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:34:41] 06cloud-services-team, 10Cloud-VPS: openstack: increase virtual network observability - https://phabricator.wikimedia.org/T380886 (10aborrero) 03NEW [16:37:20] 06cloud-services-team, 10Cloud-VPS: openstack: increase virtual network observability - https://phabricator.wikimedia.org/T380886#10358325 (10aborrero) p:05Triage→03Medium [16:45:18] 06cloud-services-team, 10wikitech.wikimedia.org, 06Data-Persistence, 06DC-Ops, and 3 others: Decommission clouddb2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T369308#10358358 (10Papaul) [16:46:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1062.eqiad.wmnet}' (T380731) [16:46:22] T380731: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731 [16:46:32] 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): openstack: increase virtual network observability - https://phabricator.wikimedia.org/T380886#10358371 (10aborrero) [16:49:21] 06cloud-services-team, 10wikitech.wikimedia.org, 06Data-Persistence, 06DC-Ops, and 3 others: Decommission clouddb2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T369308#10358361 (10Papaul) 05Open→03Resolved a:03Papaul [16:50:33] PROBLEM - Host cloudvirt1062 is DOWN: PING CRITICAL - Packet loss = 100% [16:51:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1062.eqiad.wmnet}' (T380731) [16:51:22] RECOVERY - Host cloudvirt1062 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [16:51:27] T380731: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731 [16:53:14] FIRING: Kernel error: Server cloudvirt1062 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DKernel+error [16:53:14] FIRING: Kernel warning: Server cloudvirt1062 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DKernel+warning [16:54:54] 06cloud-services-team: Kernel error Server cloudvirt1062 may have kernel errors - https://phabricator.wikimedia.org/T380889 (10phaultfinder) 03NEW [16:55:12] 06cloud-services-team, 10Toolforge: jobs-api: Impersonate user instead of loading certs from NFS - https://phabricator.wikimedia.org/T380890 (10taavi) 03NEW [16:57:07] 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): toolforge: introduce additional observability for calico - https://phabricator.wikimedia.org/T380892 (10aborrero) 03NEW [16:57:28] 06cloud-services-team, 10Toolforge: jobs-api: Impersonate user instead of loading certs from NFS - https://phabricator.wikimedia.org/T380890#10358420 (10dcaro) Could it use a service account instead? (would be simpler) [16:59:10] 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): toolforge: introduce additional observability for calico - https://phabricator.wikimedia.org/T380892#10358441 (10aborrero) p:05Triage→03Medium [17:00:56] 06cloud-services-team, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893 (10Andrew) 03NEW [17:01:00] 06cloud-services-team, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10358467 (10Andrew) [17:01:01] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10358466 (10Andrew) [17:01:36] 06cloud-services-team: Kernel error Server cloudvirt1062 may have kernel errors - https://phabricator.wikimedia.org/T380889#10358473 (10aborrero) server was rebooted [17:01:38] 06cloud-services-team, 10Toolforge: jobs-api: Impersonate user instead of loading certs from NFS - https://phabricator.wikimedia.org/T380890#10358465 (10aborrero) I guess this refers to https://kubernetes.io/docs/reference/access-authn-authz/authentication/#user-impersonation [17:02:53] 06cloud-services-team, 10Toolforge: jobs-api: Impersonate user instead of loading certs from NFS - https://phabricator.wikimedia.org/T380890#10358491 (10taavi) >>! In T380890#10358465, @aborrero wrote: > I guess this refers to https://kubernetes.io/docs/reference/access-authn-authz/authentication/#user-imperso... [17:09:03] 06cloud-services-team: Kernel error Server cloudvirt1062 may have kernel errors - https://phabricator.wikimedia.org/T380889#10358532 (10fnegri) 05Open→03Resolved a:03fnegri The error that triggered this alert is: ` fnegri@cloudvirt1062:~$ sudo journalctl -p err -k Nov 26 16:51:04 cloudvirt1062 kernel:... [17:11:42] 06cloud-services-team: Kernel error Server cloudcephmon1004 may have kernel errors - https://phabricator.wikimedia.org/T380877#10358553 (10fnegri) 05Open→03Resolved a:03fnegri > The first is expected, the second seems harmless too: Agree, I'm gonna resolve the task. [17:31:29] 06cloud-services-team: Kernel error Server cloudcontrol1005 may have kernel errors - https://phabricator.wikimedia.org/T380607#10358653 (10fnegri) 05Open→03Resolved a:03fnegri The logs have already been rotated by journald so I cannot find the message that triggered this alert, but I think they were I/... [17:47:50] 10wikitech.wikimedia.org, 10Parsoid, 06SRE: Parsoid renders "Incident status" (wikitech) incorrectly - https://phabricator.wikimedia.org/T380899 (10fnegri) 03NEW [17:51:44] 06cloud-services-team, 10Cloud-VPS, 07Documentation: ProjectProxyMainProxyDown should have a runbook page - https://phabricator.wikimedia.org/T361873#10358731 (10fnegri) [18:01:52] 10wikitech.wikimedia.org, 10Parsoid, 06SRE: Parsoid renders "Incident status" (wikitech) incorrectly - https://phabricator.wikimedia.org/T380899#10358783 (10ssastry) This may just be {T356718} which might be resolvable soon. [18:07:32] 10PAWS: update application cred for codfw1dev - https://phabricator.wikimedia.org/T380900 (10rook) 03NEW [18:10:02] 10PAWS: update application cred for codfw1dev - https://phabricator.wikimedia.org/T380900#10358821 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/464 [18:10:15] vivian-rook opened https://github.com/toolforge/paws/pull/464 [18:10:32] 10cloud-services-team (FY2024/2025-Q1-Q2): prometheus wmcloud alerts stopped sending emails - https://phabricator.wikimedia.org/T380901 (10fnegri) 03NEW [18:11:17] 10cloud-services-team (FY2024/2025-Q1-Q2): prometheus wmcloud alerts stopped sending emails - https://phabricator.wikimedia.org/T380901#10358835 (10fnegri) p:05Triage→03High [18:13:13] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: prometheus wmcloud alerts stopped sending emails - https://phabricator.wikimedia.org/T380901#10358837 (10taavi) [18:26:22] 06cloud-services-team, 10Toolforge: Increase kurbernetes quota for tools.multichill - https://phabricator.wikimedia.org/T380902 (10Multichill) 03NEW [18:33:42] 06cloud-services-team, 10Toolforge (Quota-requests): Increase kurbernetes quota for tools.multichill - https://phabricator.wikimedia.org/T380902#10358915 (10JJMC89) [19:21:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:27:11] 10PAWS: update application cred for codfw1dev - https://phabricator.wikimedia.org/T380900#10359129 (10rook) Getting: ` │ Error: Failed to get existing workspaces: operation error S3: ListObjectsV2, https response error StatusCode: 404, RequestID: tx00000ac44c5af7d0cb11f-0067461c90-c898dec-default, HostID: c898de... [19:31:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:53:14] FIRING: Kernel error: Server cloudvirt1062 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DKernel+error [20:53:14] FIRING: Kernel warning: Server cloudvirt1062 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DKernel+warning [20:53:33] 06cloud-services-team: Kernel error Server cloudvirt1062 may have kernel errors - https://phabricator.wikimedia.org/T380923 (10phaultfinder) 03NEW