[00:08:29] <wikibugs>	 10Tool-wikiqanda, 06Future-Audiences: [Bug] Investigate issues from internal testing - https://phabricator.wikimedia.org/T380799#10355908 (10derenrich) a:03derenrich https://gitlab.wikimedia.org/repos/future-audiences/wikichat/-/merge_requests/26 addresses some of it
[02:26:56] <wmcs-alerts>	 FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[02:30:21] <wmcs-alerts>	 FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[02:31:56] <wmcs-alerts>	 RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[03:35:28] <wmcs-alerts>	 FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[03:55:22] <wikibugs>	 10tool-wscontest, 07good first task: Make all interface messages translatable - https://phabricator.wikimedia.org/T346994#10356159 (10Samwilson) Thanks for the PR!  >  I have a question, for the text Index Pages:, is it okay to do {{ msg('index-pages') }}: [1] as index-pages already exists and is also used som...
[04:58:50] <wmcs-alerts>	 FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[05:03:50] <wmcs-alerts>	 RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[05:04:27] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827 (10Andrew) 03NEW
[05:07:06] <wikibugs>	 06cloud-services-team, 10Toolforge: letsencrypt issues on tools-nfs-2 - https://phabricator.wikimedia.org/T380829 (10Andrew) 03NEW
[05:11:40] <wikibugs>	 06cloud-services-team, 10Toolforge: letsencrypt issues on tools-nfs-2 - https://phabricator.wikimedia.org/T380829#10356239 (10Andrew) This same error is present throughout tools: tools-cumin-1.tools.eqiad1.wikimedia.cloud,tools-elastic-[4-6].tools.eqiad1.wikimedia.cloud,tools-harbor-1.tools.eqiad1.wikimedia.cl...
[05:45:28] <wmcs-alerts>	 RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[05:47:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-sgebastion-10 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[05:52:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-sgebastion-10 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[06:15:21] <wmcs-alerts>	 RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[06:15:51] <wmcs-alerts>	 FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[06:45:34] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers (T380827)
[06:47:48] <stashbot>	 T380827: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827
[06:49:16] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10356296 (10Andrew) I'm rebooting nfs nodes via the cookbook. Multiple people are seeing intermittent dns errors; I'm not sure how they can be related but this seems like a good first step.
[06:55:51] <wmcs-alerts>	 RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[07:05:21] <wmcs-alerts>	 FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[07:10:21] <wmcs-alerts>	 FIRING: [2x] MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[07:21:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[07:28:52] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all NFS workers (T380827)
[07:31:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[07:36:04] <wikibugs>	 06cloud-services-team, 10Toolforge: jobs-api crashing - https://phabricator.wikimedia.org/T380832#10356349 (10Andrew) This seems to be resolved now, pending questions are:  - why no alerts? - are the docs as wrong as the look to me at 2AM?
[08:24:53] <wikibugs>	 06cloud-services-team, 10Toolforge: [harbor] some artifacts and projects seems to have gone missing - https://phabricator.wikimedia.org/T380833 (10Slst2020) 03NEW
[08:33:12] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-61
[08:33:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[08:34:24] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-61
[08:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:07:28] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17
[09:08:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:08:41] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-17
[09:08:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:11:25] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-50
[09:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:12:31] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-50
[09:12:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:12:48] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-70
[09:12:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:13:54] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-70
[09:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:14:26] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-72
[09:14:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:15:35] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-72
[09:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:19:46] <wikibugs>	 10PAWS: [bug] <your request here> - https://phabricator.wikimedia.org/T380834 (10Ravi7453) 03NEW
[09:36:42] <wikibugs>	 06cloud-services-team, 10Toolforge, 07Kubernetes: DNS errors on toolforge kubernetes - https://phabricator.wikimedia.org/T380837 (10Count_Count) 03NEW
[10:17:24] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-8
[10:17:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:18:07] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-control-8
[10:18:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:18:50] <wikibugs>	 06cloud-services-team, 10Toolforge: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844 (10Slst2020) 03NEW
[10:20:21] <wmcs-alerts>	 RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[10:22:22] <wikibugs>	 06cloud-services-team, 10Toolforge: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10356723 (10hashar)
[10:22:24] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-9
[10:22:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:23:05] <wikibugs>	 06cloud-services-team, 10Toolforge, 07Kubernetes: DNS errors on toolforge kubernetes - https://phabricator.wikimedia.org/T380837#10356711 (10Slst2020) This is related to {T380844}
[10:23:08] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-control-9
[10:23:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:23:58] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure, 10ci-test-error (WMF-deployed Build Failure), 10Release-Engineering-Team (Seen): Various CI jobs failing with: Could not resolve host: gerrit.wikimedia.org - https://phabricator.wikimedia.org/T374830#10356713 (10hashar) 05O...
[10:30:37] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-7
[10:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:31:20] <wm-bot2>	 !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-control-7
[10:31:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:31:58] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-7
[10:32:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:32:31] <wm-bot2>	 !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-control-7
[10:32:32] <wikibugs>	 06cloud-services-team, 10Toolforge, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10356757 (10Peachey88)
[10:32:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:33:14] <wmcs-alerts>	 FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-control-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[10:33:41] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-7
[10:34:11] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-control-7
[10:35:22] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-7
[10:35:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:36:21] <wm-bot2>	 !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-control-7
[10:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:38:14] <wmcs-alerts>	 FIRING: [2x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-control-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[10:40:59] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-7
[10:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:42:23] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-control-7
[10:42:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:43:14] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-control-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[10:44:09] <wikibugs>	 06cloud-services-team, 10Toolforge, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10356793 (10hashar)
[11:09:52] <wikibugs>	 (03PS1) 10David Caro: toolforge.k8s.reboot: swap the control node if it's the one to reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1097991 (https://phabricator.wikimedia.org/T380844)
[11:12:09] <wikibugs>	 06cloud-services-team, 10Toolforge, 13Patch-For-Review, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10356999 (10dcaro) p:05Triage→03High
[11:18:41] <wikibugs>	 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T380426#10357057 (10Curb_Safe_Charmer) 05Open→03Resolved
[11:21:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] toolforge.k8s.reboot: swap the control node if it's the one to reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1097991 (https://phabricator.wikimedia.org/T380844) (owner: 10David Caro)
[11:23:12] <wikibugs>	 10PAWS: [bug] <your request here> - https://phabricator.wikimedia.org/T380834#10357091 (10dcaro) 05Open→03Declined Forgot to fill up I guess
[11:25:13] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephmon1004.eqiad....
[11:29:08] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephmon1004.eq...
[12:03:27] <wikibugs>	 06cloud-services-team, 10Toolforge: jobs-api crashing - https://phabricator.wikimedia.org/T380832#10357175 (10aborrero)
[12:04:29] <wikibugs>	 06cloud-services-team, 10Toolforge: jobs-api crashing - https://phabricator.wikimedia.org/T380832#10357169 (10aborrero) p:05Triage→03Low >>! In T380832#10356349, @Andrew wrote: > This seems to be resolved now, pending questions are: >  > - why no alerts?  Were jobs-api pods crashing? I think the monitoring...
[12:15:52] <wikibugs>	 06cloud-services-team, 10Toolforge, 13Patch-For-Review, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10357206 (10aborrero)
[12:15:55] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357207 (10aborrero)
[12:23:57] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357238 (10aborrero) p:05Triage→03High Regarding why NFS stopped responding, I did some quick research.  I can see some log entries:  ` Nov 26 02:27:50 tools-sgebastion-10 kernel: [9562349.633512] n...
[12:24:39] <wikibugs>	 06cloud-services-team, 10Toolforge: jobs-api crashing - https://phabricator.wikimedia.org/T380832#10357247 (10Slst2020) >>! In T380832#10357169, @aborrero wrote: > I think the actual problem was {T380844}.  Yes, I can confirm this: ` sed by NameResolutionError(     "<urllib3.connection.HTTPSConnection object a...
[12:34:13] <logmsgbot_cloud>	 !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.vm_console
[12:34:16] <logmsgbot_cloud>	 !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99)
[12:34:29] <logmsgbot_cloud>	 !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.vm_console
[12:34:31] <logmsgbot_cloud>	 !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99)
[12:35:00] <logmsgbot_cloud>	 !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.vm_console
[12:35:03] <logmsgbot_cloud>	 !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99)
[12:40:58] <logmsgbot_cloud>	 !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.vm_console
[12:41:03] <logmsgbot_cloud>	 !log aborrero@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255)
[12:48:09] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357359 (10aborrero) A couple of minutes before the nfs server was reported as not responding, the neutron-openvswith-agent running on the cloudvirt hosting the nfs server had a problem:  ` Nov 26 02:23...
[12:50:33] <wikibugs>	 06cloud-services-team, 10Toolforge: letsencrypt issues on tools-nfs-2 - https://phabricator.wikimedia.org/T380829#10357365 (10aborrero) p:05Triage→03Medium
[12:51:43] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357364 (10dcaro) > So my theory is maybe a ceph network hiccup?  There's no traffic interruption, errors spike or drops spike on the (cloudsw) switches, nor flips on the ceph health/degraded objects da...
[12:55:55] <wikibugs>	 06cloud-services-team: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357382 (10aborrero) error is  ` Nov 23 11:28:51 cloudvirt1061 kernel: Memory failure: 0x4fc0380: unhandlable page. `
[13:03:58] <wikibugs>	 (03CR) 10David Caro: [C:03+2] toolforge.k8s.reboot: swap the control node if it's the one to reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1097991 (https://phabricator.wikimedia.org/T380844) (owner: 10David Caro)
[13:04:26] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357417 (10aborrero) supporting the theory of a some kind of general openstack network problems, openvswitch failed in pretty much all the cloudvirts more or less at the same time:  {P71184}
[13:04:44] <wikibugs>	 10tool-wscontest, 07good first task: Add UTC in the WSContest contest page - https://phabricator.wikimedia.org/T331225#10357444 (10AS1100K) 05Open→03Resolved
[13:07:37] <wikibugs>	 (03Merged) 10jenkins-bot: toolforge.k8s.reboot: swap the control node if it's the one to reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1097991 (https://phabricator.wikimedia.org/T380844) (owner: 10David Caro)
[13:09:47] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357460 (10dcaro) An upstream says its a "harmless log message" https://bugzilla.redhat.com/show_bug.cgi?id=1506035
[13:10:10] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357462 (10aborrero) My current theory is that there was rollout of a puppet change, that restarted openvswitch across all hypervisors, causing a brief network outage, that was magnified by NFS:  ` Nov...
[13:14:42] <wikibugs>	 06cloud-services-team: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357469 (10aborrero)
[13:15:13] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephmon1004.eqiad....
[13:16:34] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357472 (10dcaro) This one never got resolved https://bugs.launchpad.net/neutron/+bug/1868098 :/  > My current theory is that there was rollout of a puppet change, that restarted openvswitch across all...
[13:17:12] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357474 (10aborrero) p:05Triage→03Medium hey @Jhancock.wm @Jclark-ctr Do you know if this is concerning, and if we should be taking proactive acti...
[13:17:37] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357479 (10aborrero) a:03Jhancock.wm
[13:21:12] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357487 (10aborrero) >>! In T380827#10357472, @dcaro wrote: >  >> My current theory is that there was rollout of a puppet change, that restarted openvswitch across all hypervisors, causing a brief netwo...
[13:33:39] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357549 (10dcaro) 05Open→03Resolved
[13:33:57] <wikibugs>	 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10357556 (10dcaro)
[13:34:43] <wikibugs>	 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10357557 (10dcaro) 05Open→03Resolved Last node added
[13:35:45] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357545 (10dcaro) 05Resolved→03Open Node up and running
[13:40:31] <wikibugs>	 10PAWS: openrefine in PAWS fails silently to upload new WD item - https://phabricator.wikimedia.org/T380737#10357599 (10rook) Thank you @Spinster.  @So9q  are you able to reproduce?
[13:50:50] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 16): [harbor] Do not clean up images currently running in production - https://phabricator.wikimedia.org/T377854#10357670 (10Raymond_Ndibe) 05In progress→03Resolved
[13:52:24] <wikibugs>	 10cloud-services-team (FY2024/2025-Q1-Q2), 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10357672 (10elukey) Reached out to Joanna to confirm the user to the group, but it LGTM.
[13:52:28] <wikibugs>	 10Toolforge (Toolforge iteration 16): [lima-kilo] allow for the creation of a multi-node high availability cluster - https://phabricator.wikimedia.org/T374585#10357665 (10Raymond_Ndibe) 05In progress→03Resolved
[13:53:23] <wikibugs>	 10Toolforge (Toolforge iteration 16): lima-kilo installation giving inconsistent result. Sometimes it works, sometimes it doesn't - https://phabricator.wikimedia.org/T375163#10357676 (10Raymond_Ndibe) 05Open→03Resolved
[13:55:38] <wikibugs>	 10Toolforge (Toolforge iteration 16), 13Patch-For-Review: [lima-kilo] support caching of container images using a cache disk - https://phabricator.wikimedia.org/T378180#10357667 (10Raymond_Ndibe) 05In progress→03Resolved
[14:02:16] <wikibugs>	 10cloud-services-team (FY2024/2025-Q1-Q2), 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10357729 (10fnegri) 05Open→03Stalled Joanna is out sick, but I discussed this with her and we have a team-wide meeting...
[14:02:28] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357735 (10Jclark-ctr) @aborrero  i have updated Idrac firmware.  I  assume Dell will want me to update bios firmware which will require rebo...
[14:51:44] <wikibugs>	 06cloud-services-team, 10Toolforge: [harbor] some artifacts and projects seems to have gone missing - https://phabricator.wikimedia.org/T380833#10357916 (10Raymond_Ndibe)
[14:59:14] <jinxer-wm>	 FIRING: Kernel error: Server cloudcephmon1004 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudcephmon1004 - https://alerts.wikimedia.org/?q=alertname%3DKernel+error
[14:59:18] <wikibugs>	 06cloud-services-team: Kernel error Server cloudcephmon1004 may have kernel errors - https://phabricator.wikimedia.org/T380877 (10phaultfinder) 03NEW
[14:59:19] <jinxer-wm>	 FIRING: Kernel warning: Server cloudcephmon1004 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudcephmon1004 - https://alerts.wikimedia.org/?q=alertname%3DKernel+warning
[15:12:22] <logmsgbot_cloud>	 !log rook@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for main branch
[15:12:44] <logmsgbot_cloud>	 !log rook@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for main branch
[15:13:52] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10357951 (10Andrew) Were there signs of dns/network failures outside of toolforge/k8s containers? I wasn't able to find any last night when troubleshooting.
[15:13:58] <logmsgbot_cloud>	 !log rook@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/145
[15:14:26] <logmsgbot_cloud>	 !log rook@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/145
[15:14:29] <wikibugs>	 06cloud-services-team: Kernel error Server cloudcephmon1004 may have kernel errors - https://phabricator.wikimedia.org/T380877#10357953 (10dcaro) Current errors: ` root@cloudcephmon1004:~# journalctl -k -p err -- Journal begins at Tue 2024-11-26 12:46:45 UTC, ends at Tue 2024-11-26 15:12:54 UTC. -- Nov 26 13:11:...
[15:15:06] <wikibugs>	 (03merge) 10rook: Add pawsdev to codfw1dev [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/145 (https://phabricator.wikimedia.org/T380794)
[15:15:24] <logmsgbot_cloud>	 !log rook@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for main branch
[15:15:49] <logmsgbot_cloud>	 !log rook@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for main branch
[15:15:56] <logmsgbot_cloud>	 !log rook@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch
[15:16:31] <logmsgbot_cloud>	 !log rook@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch
[15:20:04] <wikibugs>	 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10357974 (10VRiley-WMF) Has this still been performing as expected? If so, are we able to close it?
[15:20:29] <wikibugs>	 06cloud-services-team: Kernel error Server cloudcephmon1004 may have kernel errors - https://phabricator.wikimedia.org/T380877#10357979 (10dcaro) The first is expected, the second seems harmless too: https://hetzbiz.cloud/2024/06/11/those-damn-mpt3sas_cm0-messages/
[15:22:33] <wikibugs>	 10PAWS: pawsdev in codfw1dev - https://phabricator.wikimedia.org/T380794#10357980 (10rook) 05Open→03Resolved
[15:30:05] <wikibugs>	 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10357993 (10dcaro) Looks good on my side 👍
[15:33:30] <wikibugs>	 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10357997 (10VRiley-WMF) 05Open→03Resolved
[15:39:10] <wmcs-alerts>	 FIRING: ProjectProxyMainProxyDown: Proxy on proxy-04 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyDown
[15:44:10] <wmcs-alerts>	 RESOLVED: ProjectProxyMainProxyDown: Proxy on proxy-04 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyDown
[15:55:59] <wikibugs>	 06cloud-services-team, 10Toolforge, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10358073 (10JJMC89)
[15:59:14] <wikibugs>	 06cloud-services-team, 10Toolforge, 07Kubernetes: DNS errors on toolforge kubernetes - https://phabricator.wikimedia.org/T380837#10358071 (10JJMC89) →14Duplicate dup:03T380844
[16:14:17] <wikibugs>	 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10358138 (10cmooney) 05Resolved→03Open >>! In T380503#10357974, @VRiley-WMF wrote: > Has this still...
[16:15:06] <wikibugs>	 06cloud-services-team: 2024-11-16 openstack network problems - https://phabricator.wikimedia.org/T380882 (10aborrero) 03NEW
[16:15:16] <wikibugs>	 06cloud-services-team: 2024-11-16 openstack network problems - https://phabricator.wikimedia.org/T380882#10358185 (10aborrero)
[16:15:19] <wikibugs>	 06cloud-services-team: 2024-11-16 openstack network problems - https://phabricator.wikimedia.org/T380882#10358187 (10aborrero)
[16:15:25] <wikibugs>	 06cloud-services-team, 10Toolforge, 07Wikimedia-Incident: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844#10358186 (10aborrero)
[16:15:26] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10358188 (10aborrero)
[16:15:40] <wikibugs>	 06cloud-services-team: 2024-11-16 openstack network problems - https://phabricator.wikimedia.org/T380882#10358189 (10aborrero) p:05Triage→03High
[16:20:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[16:21:06] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: 2024-11-16 openstack network problems - https://phabricator.wikimedia.org/T380882#10358222 (10fnegri)
[16:24:56] <wikibugs>	 06cloud-services-team, 10Toolforge: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827#10358238 (10aborrero) >>! In T380827#10357951, @Andrew wrote: > Were there signs of dns/network failures outside of toolforge/k8s containers? I wasn't able to find any last night when troubleshooting.  W...
[16:30:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[16:34:41] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: openstack: increase virtual network observability - https://phabricator.wikimedia.org/T380886 (10aborrero) 03NEW
[16:37:20] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: openstack: increase virtual network observability - https://phabricator.wikimedia.org/T380886#10358325 (10aborrero) p:05Triage→03Medium
[16:45:18] <wikibugs>	 06cloud-services-team, 10wikitech.wikimedia.org, 06Data-Persistence, 06DC-Ops, and 3 others: Decommission clouddb2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T369308#10358358 (10Papaul)
[16:46:15] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1062.eqiad.wmnet}' (T380731)
[16:46:22] <stashbot>	 T380731: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731
[16:46:32] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): openstack: increase virtual network observability - https://phabricator.wikimedia.org/T380886#10358371 (10aborrero)
[16:49:21] <wikibugs>	 06cloud-services-team, 10wikitech.wikimedia.org, 06Data-Persistence, 06DC-Ops, and 3 others: Decommission clouddb2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T369308#10358361 (10Papaul) 05Open→03Resolved a:03Papaul
[16:50:33] <icinga-wm>	 PROBLEM - Host cloudvirt1062 is DOWN: PING CRITICAL - Packet loss = 100%
[16:51:20] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1062.eqiad.wmnet}' (T380731)
[16:51:22] <icinga-wm>	 RECOVERY - Host cloudvirt1062 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms
[16:51:27] <stashbot>	 T380731: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731
[16:53:14] <jinxer-wm>	 FIRING: Kernel error: Server cloudvirt1062 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DKernel+error
[16:53:14] <jinxer-wm>	 FIRING: Kernel warning: Server cloudvirt1062 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DKernel+warning
[16:54:54] <wikibugs>	 06cloud-services-team: Kernel error Server cloudvirt1062 may have kernel errors - https://phabricator.wikimedia.org/T380889 (10phaultfinder) 03NEW
[16:55:12] <wikibugs>	 06cloud-services-team, 10Toolforge: jobs-api: Impersonate user instead of loading certs from NFS - https://phabricator.wikimedia.org/T380890 (10taavi) 03NEW
[16:57:07] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): toolforge: introduce additional observability for calico - https://phabricator.wikimedia.org/T380892 (10aborrero) 03NEW
[16:57:28] <wikibugs>	 06cloud-services-team, 10Toolforge: jobs-api: Impersonate user instead of loading certs from NFS - https://phabricator.wikimedia.org/T380890#10358420 (10dcaro) Could it use a service account instead? (would be simpler)
[16:59:10] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): toolforge: introduce additional observability for calico - https://phabricator.wikimedia.org/T380892#10358441 (10aborrero) p:05Triage→03Medium
[17:00:56] <wikibugs>	 06cloud-services-team, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893 (10Andrew) 03NEW
[17:01:00] <wikibugs>	 06cloud-services-team, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10358467 (10Andrew)
[17:01:01] <wikibugs>	 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10358466 (10Andrew)
[17:01:36] <wikibugs>	 06cloud-services-team: Kernel error Server cloudvirt1062 may have kernel errors - https://phabricator.wikimedia.org/T380889#10358473 (10aborrero) server was rebooted
[17:01:38] <wikibugs>	 06cloud-services-team, 10Toolforge: jobs-api: Impersonate user instead of loading certs from NFS - https://phabricator.wikimedia.org/T380890#10358465 (10aborrero) I guess this refers to https://kubernetes.io/docs/reference/access-authn-authz/authentication/#user-impersonation
[17:02:53] <wikibugs>	 06cloud-services-team, 10Toolforge: jobs-api: Impersonate user instead of loading certs from NFS - https://phabricator.wikimedia.org/T380890#10358491 (10taavi) >>! In T380890#10358465, @aborrero wrote: > I guess this refers to https://kubernetes.io/docs/reference/access-authn-authz/authentication/#user-imperso...
[17:09:03] <wikibugs>	 06cloud-services-team: Kernel error Server cloudvirt1062 may have kernel errors - https://phabricator.wikimedia.org/T380889#10358532 (10fnegri) 05Open→03Resolved a:03fnegri The error that triggered this alert is:  ` fnegri@cloudvirt1062:~$ sudo journalctl -p err -k Nov 26 16:51:04 cloudvirt1062 kernel:...
[17:11:42] <wikibugs>	 06cloud-services-team: Kernel error Server cloudcephmon1004 may have kernel errors - https://phabricator.wikimedia.org/T380877#10358553 (10fnegri) 05Open→03Resolved a:03fnegri > The first is expected, the second seems harmless too:  Agree, I'm gonna resolve the task.
[17:31:29] <wikibugs>	 06cloud-services-team: Kernel error Server cloudcontrol1005 may have kernel errors - https://phabricator.wikimedia.org/T380607#10358653 (10fnegri) 05Open→03Resolved a:03fnegri The logs have already been rotated by journald so I cannot find the message that triggered this alert, but I think they were I/...
[17:47:50] <wikibugs>	 10wikitech.wikimedia.org, 10Parsoid, 06SRE: Parsoid renders "Incident status" (wikitech) incorrectly - https://phabricator.wikimedia.org/T380899 (10fnegri) 03NEW
[17:51:44] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 07Documentation: ProjectProxyMainProxyDown should have a runbook page - https://phabricator.wikimedia.org/T361873#10358731 (10fnegri)
[18:01:52] <wikibugs>	 10wikitech.wikimedia.org, 10Parsoid, 06SRE: Parsoid renders "Incident status" (wikitech) incorrectly - https://phabricator.wikimedia.org/T380899#10358783 (10ssastry) This may just be {T356718} which might be resolvable soon.
[18:07:32] <wikibugs>	 10PAWS: update application cred for codfw1dev - https://phabricator.wikimedia.org/T380900 (10rook) 03NEW
[18:10:02] <wikibugs>	 10PAWS: update application cred for codfw1dev - https://phabricator.wikimedia.org/T380900#10358821 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/464
[18:10:15] <notefromgithub>	 vivian-rook opened https://github.com/toolforge/paws/pull/464
[18:10:32] <wikibugs>	 10cloud-services-team (FY2024/2025-Q1-Q2): prometheus wmcloud alerts stopped sending emails - https://phabricator.wikimedia.org/T380901 (10fnegri) 03NEW
[18:11:17] <wikibugs>	 10cloud-services-team (FY2024/2025-Q1-Q2): prometheus wmcloud alerts stopped sending emails - https://phabricator.wikimedia.org/T380901#10358835 (10fnegri) p:05Triage→03High
[18:13:13] <wikibugs>	 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: prometheus wmcloud alerts stopped sending emails - https://phabricator.wikimedia.org/T380901#10358837 (10taavi)
[18:26:22] <wikibugs>	 06cloud-services-team, 10Toolforge: Increase kurbernetes quota for tools.multichill - https://phabricator.wikimedia.org/T380902 (10Multichill) 03NEW
[18:33:42] <wikibugs>	 06cloud-services-team, 10Toolforge (Quota-requests): Increase kurbernetes quota for tools.multichill - https://phabricator.wikimedia.org/T380902#10358915 (10JJMC89)
[19:21:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[19:27:11] <wikibugs>	 10PAWS: update application cred for codfw1dev - https://phabricator.wikimedia.org/T380900#10359129 (10rook) Getting: ` │ Error: Failed to get existing workspaces: operation error S3: ListObjectsV2, https response error StatusCode: 404, RequestID: tx00000ac44c5af7d0cb11f-0067461c90-c898dec-default, HostID: c898de...
[19:31:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[20:20:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[20:30:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[20:53:14] <jinxer-wm>	 FIRING: Kernel error: Server cloudvirt1062 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DKernel+error
[20:53:14] <jinxer-wm>	 FIRING: Kernel warning: Server cloudvirt1062 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DKernel+warning
[20:53:33] <wikibugs>	 06cloud-services-team: Kernel error Server cloudvirt1062 may have kernel errors - https://phabricator.wikimedia.org/T380923 (10phaultfinder) 03NEW