[00:16:29] (03open) 10bd808: App.vue: Allow setting savePath externally [toolforge-repos/sitesampler] - 10https://gitlab.wikimedia.org/toolforge-repos/sitesampler/-/merge_requests/11 [00:17:33] (03update) 10bd808: App.vue: Allow setting savePath externally [toolforge-repos/sitesampler] - 10https://gitlab.wikimedia.org/toolforge-repos/sitesampler/-/merge_requests/11 [00:18:09] (03merge) 10bd808: App.vue: Allow setting savePath externally [toolforge-repos/sitesampler] - 10https://gitlab.wikimedia.org/toolforge-repos/sitesampler/-/merge_requests/11 [00:34:41] (03open) 10bd808: Consolidate side loading config [toolforge-repos/sitesampler] - 10https://gitlab.wikimedia.org/toolforge-repos/sitesampler/-/merge_requests/12 [00:35:19] (03merge) 10bd808: Consolidate side loading config [toolforge-repos/sitesampler] - 10https://gitlab.wikimedia.org/toolforge-repos/sitesampler/-/merge_requests/12 [01:09:27] 10Tool-sitesampler: Userscript integration's save feature is broken - https://phabricator.wikimedia.org/T385246 (10bd808) 03NEW [01:11:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [01:16:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [01:16:37] 10Tool-sitesampler: Simplify `modify_html` by using a `` tag - https://phabricator.wikimedia.org/T385247 (10bd808) 03NEW [01:23:28] FIRING: InstanceDown: Project tools instance tools-redis-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:27:09] 10Tool-sitesampler: Add an "off" switch of some sort for the eventlisteners - https://phabricator.wikimedia.org/T385248 (10bd808) 03NEW [01:37:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [01:41:19] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [01:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:06:36] 06cloud-services-team, 10Data-Services, 06DBA: Prepare and check storage layer for kncwiki - https://phabricator.wikimedia.org/T385182#10511068 (10Marostegui) p:05Triage→03Medium Sanitarium hosts cleaned. `kncwiki_p` database created Grants added for `lasbdbuser` Triggers confirmed working with my user.... [07:14:22] 06cloud-services-team, 10Data-Services, 06DBA: Prepare and check storage layer for kncwiki - https://phabricator.wikimedia.org/T385182#10511117 (10JJMC89) >>! In T385182#10511068, @Marostegui wrote: > This is ready for views creation. {T385188} [07:24:33] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services: [wikireplicas] Create views for new wiki kncwiki - https://phabricator.wikimedia.org/T385188#10511131 (10Marostegui) This is ready for views creation. [07:26:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [07:44:14] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10511136 (10VladimirAlexiev) I have a wikimedia account `Vladimir_Alexiev` (that I use for wikidata and across wikipedias) but not a wikitech account. I wanted to post to the @Ladsgroup talk... [08:33:10] 10Tool-ranker, 06translatewiki.net, 10LPL Essential (LPL Essential 2024 Nov-Jan), 13Patch-For-Review, 07Unplanned-Sprint-Work: Add Ranker to translatewiki.net - https://phabricator.wikimedia.org/T384061#10511192 (10Wangombe) We've had trouble with exports and imports across translatewiki.net. I don't thi... [09:29:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [09:34:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [09:53:08] 06cloud-services-team, 10Data-Services, 06DBA: Prepare and check storage layer for kncwiki - https://phabricator.wikimedia.org/T385182#10511300 (10fnegri) 05Open→03Resolved a:03fnegri I think this task can be resolved, as Maintenance_bot now creates a dedicated task for view creation (linked in the... [09:58:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [09:58:55] 06cloud-services-team, 10Data-Services, 06DBA: Prepare and check storage layer for kncwiki - https://phabricator.wikimedia.org/T385182#10511315 (10Marostegui) Thanks! [10:03:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [10:04:52] 06cloud-services-team, 10Toolforge: [toolforge] [redis] Prometheus exporter logging errors - https://phabricator.wikimedia.org/T366471#10511333 (10aborrero) I saw the errors today again: ` Jan 31 09:23:16 tools-redis-7 prometheus-redis-exporter[194603]: time="2025-01-31T09:23:16Z" level=error msg="Couldn't se... [10:11:47] 06cloud-services-team: toolforge: alertmanager reports maintain-kubeusers as down, but it isn't - https://phabricator.wikimedia.org/T385262 (10aborrero) 03NEW [10:11:59] 06cloud-services-team: toolforge: alertmanager reports maintain-kubeusers as down, but it isn't - https://phabricator.wikimedia.org/T385262#10511353 (10aborrero) p:05Triage→03Medium [10:15:57] 06cloud-services-team, 10Toolforge: toolforge: alertmanager reports maintain-kubeusers as down, but it isn't - https://phabricator.wikimedia.org/T385262#10511357 (10taavi) [10:24:41] 06cloud-services-team, 10Toolforge: toolforge: alertmanager reports maintain-kubeusers as down, but it isn't - https://phabricator.wikimedia.org/T385262#10511381 (10fnegri) There is some inconsistency in the metrics data: depending on the time window I select in Grafana, I get different values for the same tim... [10:25:55] (03close) 10salelya: init [toolforge-repos/multilingual-missing-articles] - 10https://gitlab.wikimedia.org/toolforge-repos/multilingual-missing-articles/-/merge_requests/1 [10:29:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [10:34:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [10:42:16] 06cloud-services-team, 10Bitu, 06Infrastructure-Foundations, 07LDAP: Allocate more available UNIX UIDs for human users - https://phabricator.wikimedia.org/T355663#10511416 (10taavi) Today we're up to 48484. So the growth has slowed a bit to about 6.7 accounts per day, which would mean we have until mid-Sep... [10:43:52] 06cloud-services-team, 10Toolforge: toolforge: alertmanager reports maintain-kubeusers as down, but it isn't - https://phabricator.wikimedia.org/T385262#10511420 (10fnegri) The `MaintainKubeusersDown` is NOT firing if I look at https://prometheus.svc.toolforge.org/tools/alerts?search=maintain but it IS firing... [10:47:13] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/ranker] - 10https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_requests/2 [10:58:09] 06cloud-services-team, 10Toolforge: toolforge: alertmanager reports maintain-kubeusers as down, but it isn't - https://phabricator.wikimedia.org/T385262#10511440 (10fnegri) Yes the 2 servers are not in sync: ` root@tools-prometheus-6:~# promtool query instant http://localhost:9902/tools 'up{job="k8s-maintain-... [11:01:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:06:53] 06cloud-services-team, 10Toolforge, 10Lingua-Libre: Assess opportunity to migrate from WMFR-OVH server to WMF Toolforge or WMF Cloud VPS - https://phabricator.wikimedia.org/T385064#10511450 (10Yug) a:03Yug [11:07:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:13:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [11:18:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [11:20:54] 06cloud-services-team, 10Toolforge: alertmanager reports maintain-kubeusers and tools-redis-7 as down, but they are up - https://phabricator.wikimedia.org/T385262#10511482 (10fnegri) [11:23:37] 06cloud-services-team, 10Toolforge: alertmanager reports maintain-kubeusers and tools-redis-7 as down, but they are up - https://phabricator.wikimedia.org/T385262#10511486 (10fnegri) The restart fixed the `k8s-maintain-kubeusers` metric that is now reporting the same value in both servers: ` fnegri@tools-prom... [11:34:52] 06cloud-services-team, 10Cloud-VPS: VM live migration failing for many/most VMs - https://phabricator.wikimedia.org/T385264 (10Andrew) 03NEW [11:34:58] 06cloud-services-team, 10Cloud-VPS: VM live migration failing for many/most VMs - https://phabricator.wikimedia.org/T385264#10511551 (10Andrew) ` [None req-e95a879c-553e-4469-a2b3-ec2d3c0168c0 novaadmin admin - - default default] [instance: b54ae99c-fce8-4f0a-bb32-33d5a45e44b2] Live Migration failure: internal... [11:40:25] 06cloud-services-team, 10Toolforge: alertmanager reports maintain-kubeusers and tools-redis-7 as down, but they are up - https://phabricator.wikimedia.org/T385262#10511557 (10fnegri) Even after a VM reboot, metricsinfra-prometheus-3 is still showing the wrong value: ` root@metricsinfra-prometheus-3:~# promtoo... [11:42:17] 06cloud-services-team, 10Toolforge: alertmanager reports maintain-kubeusers and tools-redis-7 as down, but they are up - https://phabricator.wikimedia.org/T385262#10511569 (10fnegri) 05Open→03In progress a:03fnegri [11:44:18] 06cloud-services-team, 10Cloud-VPS: VM live migration failing for many/most VMs - https://phabricator.wikimedia.org/T385264#10511576 (10Andrew) indeed, when the VM starts up on the new host it is trying to contact cloudcephmons that don't exist anymore. So this is a related issue to T383583 [12:22:48] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10511743 (10Reedy) ` reedy@deploy2002:~$ mwscript extensions/CentralAuth/maintenance/createLocalAccount.php --wiki=labswiki Vladimir_Alexiev DEPRECATION WARNING: Maintenance scripts are movin... [12:37:48] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:42:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:46:48] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:47:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [12:47:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:47:48] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:49:48] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:52:12] FIRING: [2x] KernelErrors: Server cloudvirt1031 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudvirt1031 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [12:53:14] 10Tool-ranker, 06translatewiki.net, 10LPL Essential (LPL Essential 2024 Nov-Jan), 13Patch-For-Review, 07Unplanned-Sprint-Work: Add Ranker to translatewiki.net - https://phabricator.wikimedia.org/T384061#10511869 (10abi_) Exports happened today: https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_... [12:54:31] RESOLVED: [2x] KernelErrors: Server cloudvirt1031 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudvirt1031 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [12:54:48] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:57:07] FIRING: [4x] KernelErrors: Server cloudvirt1031 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [12:57:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [13:02:07] RESOLVED: [4x] KernelErrors: Server cloudvirt1031 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [13:04:31] FIRING: [6x] KernelErrors: Server cloudvirt1031 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [13:06:22] 10cloud-services-team (Hardware), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10511956 (10Papaul) a:05VRiley-WMF→03Andrew [13:09:31] RESOLVED: [6x] KernelErrors: Server cloudvirt1031 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [13:12:07] FIRING: [8x] KernelErrors: Server cloudvirt1031 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [13:16:50] (03PS1) 10Muehlenhoff: Add stub secrets for bookworm replica role [labs/private] - 10https://gerrit.wikimedia.org/r/1115846 (https://phabricator.wikimedia.org/T381565) [13:17:07] RESOLVED: [6x] KernelErrors: Server cloudvirt1032 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [13:18:55] (03update) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/ranker] - 10https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_requests/2 (owner: 10l10n-bot) [13:19:31] FIRING: [8x] KernelErrors: Server cloudvirt1032 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [13:19:57] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add stub secrets for bookworm replica role [labs/private] - 10https://gerrit.wikimedia.org/r/1115846 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:20:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [13:22:07] RESOLVED: [4x] KernelErrors: Server cloudvirt1033 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [13:22:31] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/ranker] - 10https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_requests/2 (owner: 10l10n-bot) [13:22:34] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/ranker] - 10https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_requests/2 (owner: 10l10n-bot) [13:47:41] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [13:49:06] 10Tool-ranker, 06translatewiki.net, 10LPL Essential (LPL Essential 2024 Nov-Jan), 13Patch-For-Review, 07Unplanned-Sprint-Work: Add Ranker to translatewiki.net - https://phabricator.wikimedia.org/T384061#10512190 (10LucasWerkmeister) I also just realized I completely forgot to implement a language prefere... [14:01:58] 06cloud-services-team, 10Cloud-VPS: VM live migration failing for many/most VMs - https://phabricator.wikimedia.org/T385264#10512232 (10Andrew) Need to figure out which (if any) of the following will resolve the issue: - reboot (from w/in the VM) - hard reboot from horizon - cold migrate And then sort out VM... [14:05:58] 06cloud-services-team, 10Cloud-VPS: VM live migration failing for many/most VMs - https://phabricator.wikimedia.org/T385264#10512256 (10Andrew) ` andrew@cloudcumin1001:~$ sudo cumin --force "cloudvirt1*" 'ps -ef | grep 10.64.20.68' ` shows the scope of the issue. Need to add some sed to that to extract a li... [14:10:38] 06cloud-services-team, 10Cloud-VPS: VM live migration failing for many/most VMs - https://phabricator.wikimedia.org/T385264#10512266 (10Andrew) andrew@cloudcumin1001:~$ sudo cumin --force "cloudvirt1*" "ps -ef | grep 10.64.20.68 | grep -v grep | sed 's/^.*-uuid //' | sed 's/ .*//g' " [14:16:11] 06cloud-services-team, 10Cloud-VPS: VM live migration failing for many/most VMs - https://phabricator.wikimedia.org/T385264#10512270 (10Andrew) reboot from within the VM does not work, hard reboot seems to work (for a single test) [14:34:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [14:39:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [14:41:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [14:44:26] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [14:50:25] 06cloud-services-team, 10Cloud-VPS: Changing the IPs of cloudcephmons should not require VM reboots - https://phabricator.wikimedia.org/T385288 (10fnegri) 03NEW [14:50:38] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10512361 (10fnegri) > This leaves the followup of understanding how to prevent this the next time we get new cloudcephmons. I created {T385288}. [14:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:54:19] 06cloud-services-team, 10Cloud-VPS: Changing the IPs of cloudcephmons should not require VM reboots - https://phabricator.wikimedia.org/T385288#10512368 (10aborrero) What about using both service IPs and service FQDNs? this way the clients may not notice if we assign an IP to a different host. They may notice... [15:00:22] 06cloud-services-team, 10Cloud-VPS: Changing the IPs of cloudcephmons should not require VM reboots - https://phabricator.wikimedia.org/T385288#10512381 (10Andrew) Yes, I think service IPs/fqdns is the solution to this. This discussion post implies that a reboot is necessary for any actual change: https://www... [15:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:03:06] 10cloud-services-team (Hardware), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10512386 (10cmooney) >>! In T382412#10499217, @VRiley-WMF wrote: > The servers were getting the IP address from p... [15:07:38] 10cloud-services-team (Hardware), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10512402 (10cmooney) @Andrew @aborrero remember we have these static routes on the cloudsw pointing to the cloudg... [15:23:29] PROBLEM - mysqld processes on clouddb1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:24:29] RECOVERY - mysqld processes on clouddb1013 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:27:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [15:37:42] 06cloud-services-team, 10Toolforge, 10Lingua-Libre: Assess opportunity to migrate from WMFR-OVH server to WMF Toolforge or WMF Cloud VPS - https://phabricator.wikimedia.org/T385064#10512479 (10Yug) [15:38:49] 06cloud-services-team, 10Toolforge, 10Lingua-Libre: Assess opportunity to migrate from WMFR-OVH server to WMF Toolforge or WMF Cloud VPS - https://phabricator.wikimedia.org/T385064#10512482 (10Yug) [15:42:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [15:47:26] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [16:17:26] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [16:38:16] 06cloud-services-team, 10Toolforge: alertmanager reports maintain-kubeusers and tools-redis-7 as down, but they are up - https://phabricator.wikimedia.org/T385262#10512596 (10fnegri) Apparently it's a networking issues, as I cannot ping `tools-redis-7` from `metricsinfra-prometheus-3`: ` fnegri@metricsinfra-p... [16:43:51] 06cloud-services-team, 10Toolforge: DNS resolver not working on Toolforge when loading PHP script via browser - https://phabricator.wikimedia.org/T385291 (10Albertoleoncio) 03NEW [16:43:52] 10Tool-sitesampler: Simplify `modify_html` by using a `` tag - https://phabricator.wikimedia.org/T385247#10512643 (10bd808) 05Open→03In progress a:03bd808 [16:49:02] 06cloud-services-team, 10Toolforge: DNS resolver not working on Toolforge when loading PHP script via browser - https://phabricator.wikimedia.org/T385291#10512664 (10bd808) Quite likely just another case of {T374830}. [16:51:38] 06cloud-services-team, 10Toolforge: DNS resolver not working on Toolforge when loading PHP script via browser - https://phabricator.wikimedia.org/T385291#10512674 (10aborrero) is the error happening every time you visit that URL endpoint? https://alberobot.toolforge.org/blockrollback.php in other words, is thi... [16:56:06] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_redirects_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:01:06] RESOLVED: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_redirects_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:42:58] (03open) 10reedy: channels.yaml: Add another AWB tag [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/51 [18:20:50] (03update) 10reedy: channels.yaml: Add another AWB tag [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/51 [18:21:39] (03update) 10reedy: channels.yaml: Widen AutoWikiBrowser regex [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/51 [18:21:43] (03update) 10reedy: channels.yaml: Widen AutoWikiBrowser regex [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/51 [18:25:52] 10Wikibugs: CI seems to be broken - https://phabricator.wikimedia.org/T385302 (10Reedy) 03NEW [18:27:25] 10Wikibugs: CI seems to be broken - https://phabricator.wikimedia.org/T385302#10512999 (10Reedy) {rTWBTc86e59c518d44c9c6b1ee60f26ae4098c5dea817}... [18:27:52] 10Wikibugs: CI seems to be broken - https://phabricator.wikimedia.org/T385302#10513001 (10Reedy) Ah, https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/50 [18:28:03] 10Wikibugs: CI seems to be broken - https://phabricator.wikimedia.org/T385302#10513004 (10Reedy) [18:30:00] 10Wikibugs: Upgrade quart - https://phabricator.wikimedia.org/T385303 (10Reedy) 03NEW [18:30:24] (03merge) 10reedy: Pin Flask to 3.0.3 due to incompat with Quart 0.19.4 [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/50 (owner: 10hashar) [18:31:59] (03update) 10reedy: channels.yaml: Widen AutoWikiBrowser regex [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/51 [18:44:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [18:49:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [19:04:47] (03open) 10reedy: test_phorge.py: Update Notifications name in test_get_tags_with_milestones [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/52 [19:08:22] (03merge) 10reedy: test_phorge.py: Update Notifications name in test_get_tags_with_milestones [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/52 [19:08:33] (03update) 10reedy: channels.yaml: Widen AutoWikiBrowser regex [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/51 [19:12:10] (03merge) 10reedy: channels.yaml: Widen AutoWikiBrowser regex [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/51 [20:19:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:32:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [20:33:22] (03approved) 10sstefanova: Create single-node clusters by default [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/223 (https://phabricator.wikimedia.org/T385082) (owner: 10fnegri) [20:34:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:42:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:17:44] 10Wikibugs: CI seems to be broken - https://phabricator.wikimedia.org/T385302#10513669 (10bd808) 05Open→03Resolved a:03hashar [21:25:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:40:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [22:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:30:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [22:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:40:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [23:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks