[00:06:56] FIRING: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:51:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [00:52:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [00:56:55] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [00:57:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [01:01:56] RESOLVED: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [02:00:28] RESOLVED: NfsAlmostFull: The NFS drive is over 85% capacity (currently 86.88%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [02:37:53] 06cloud-services-team, 10Cloud-VPS: Cloud VPS project creation cookbook times out really often - https://phabricator.wikimedia.org/T398712#11018926 (10Andrew) I'm looking for a few examples of this on logstash. On July 4, The keystone hooks for the project 'wikidata-deleted' took 2:09 with the majority of the... [03:17:00] 10Cloud-VPS (Project-requests): Request creation of voterlists VPS project - https://phabricator.wikimedia.org/T399418#11018955 (10Novem_Linguae) >>! In T399418#11002517, @Snaevar wrote: > -1, it is forbidden to put private sql tables to Wikimedia Cloud. This task does not have a reason to move it. I also find t... [03:48:47] 10Cloud-VPS (Project-requests): Request creation of voterlists VPS project - https://phabricator.wikimedia.org/T399418#11018962 (10Soda) I agree with @Novem_Linguae that the data being consumed is public data, and as such, there is no privacy issue. -- If the concern is transparency, the tables can be made publi... [04:32:19] 10Data-Services, 06Data-Engineering, 06Data-Engineering-Radar, 06DBA, 06Privacy Engineering: Create views for SecurePoll db tables on Wiki Replicas - https://phabricator.wikimedia.org/T381197#11019054 (10Novem_Linguae) I agree with SD0001's choices for public tables and private tables in the original pos... [05:04:20] 06cloud-services-team, 10Toolforge: Support for UDP ports in jobs - https://phabricator.wikimedia.org/T400024 (10DamianZaremba) 03NEW [05:14:04] 06cloud-services-team, 10Toolforge: Support for TCP health checking - https://phabricator.wikimedia.org/T400025 (10DamianZaremba) 03NEW [07:29:54] (03CR) 10Birusha: [C:03+1] fix: add search on different selects (continent, country, date and platform) [labs/tools/mostvisitedarticle] - 10https://gerrit.wikimedia.org/r/1166002 (https://phabricator.wikimedia.org/T390440) (owner: 10Bovimacoco) [07:30:36] (03CR) 10Birusha: [C:03+1] fix: fixing language switching in LanguageSelector [labs/tools/mostvisitedarticle] - 10https://gerrit.wikimedia.org/r/1166013 (https://phabricator.wikimedia.org/T390664) (owner: 10Bovimacoco) [08:06:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [08:07:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [08:10:32] 10wikitech.wikimedia.org, 10Wikidata, 10Wikimedia-Interwiki-links, 13Patch-For-Review, 10Wikidata Integration in Wikimedia projects (Kanban Board): Enable interwiki links to/from Wikitech - https://phabricator.wikimedia.org/T290147#11019413 (10Arendpieter) what is the status of this patch? [09:01:55] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [09:02:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [09:45:20] 10Tool-extjsonuploader: extjsonuploader seems to have stopped updating ExtensionJson - https://phabricator.wikimedia.org/T400044 (10A_smart_kitten) 03NEW [10:18:52] !log dcaro@hephaestus pawsdev START - Cookbook wmcs.vps.add_user_to_project for user 'dcaro' in role 'member' [10:18:54] wmbot~dcaro@hephaestus: Unknown project "pawsdev" [10:18:57] !log dcaro@hephaestus pawsdev END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'dcaro' in role 'member' [10:18:58] wmbot~dcaro@hephaestus: Unknown project "pawsdev" [11:32:08] dhinus opened https://github.com/toolforge/paws/pull/495 [11:33:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [11:43:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [12:24:03] (03update) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/42 [13:50:33] (03update) 10raymond-ndibe: [typing] use native types where possible [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/50 [13:53:12] FIRING: [2x] ProjectProxyMainProxyCertificateExpiry: Certificate for proxy on proxy-5 is about to expire (5d 23h 29m 52s to expiration) - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyCertificateExpiry [13:57:59] (03open) 10damian: [T400024] Allow protocol to be specified for ports [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/183 [13:58:06] (03open) 10damian: [T400024] Allow protocol to be specified for ports [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/113 [13:58:15] 06cloud-services-team, 10Toolforge: Support for UDP ports in jobs - https://phabricator.wikimedia.org/T400024#11020377 (10DamianZaremba) Gitlab access is working now, so had a dig into this a bit more. I need to test this against k8s, but something similar to this is probably all that is needed: api: https://... [13:59:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T399858) [13:59:15] T399858: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858 [13:59:28] FIRING: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.02%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [14:00:00] (03open) 10damian: [T400025] Add explicit support for TCP probes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/184 [14:00:18] (03open) 10damian: [T400025] Add explicit support for TCP probes [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/114 [14:00:21] 06cloud-services-team, 10Toolforge: Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11020383 (10DamianZaremba) Digging into this a bit more, there is actually support for this today (https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/runtimes/k8s/healthcheck... [14:01:37] (03PS1) 10Stevemunene: Add keytabs for new an-druid100[67] hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1171214 (https://phabricator.wikimedia.org/T397440) [14:05:24] (03update) 10damian: [T400025] Add explicit support for TCP probes [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/114 [14:05:56] (03update) 10raymond-ndibe: [typing] use native types where possible [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/50 [14:05:58] (03update) 10raymond-ndibe: [typing] use native types where possible [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/50 [14:06:00] (03approved) 10raymond-ndibe: [typing] use native types where possible [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/50 [14:06:06] (03merge) 10raymond-ndibe: [typing] use native types where possible [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/50 [14:08:43] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: maintain-harbor: bump to 0.0.57-20250721140622-cd1281e2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/880 [14:09:28] RESOLVED: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.02%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [14:09:51] (03update) 10damian: [T400025] Add explicit support for TCP probes [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/114 [14:11:14] (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509) [14:14:56] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [14:15:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [14:15:34] (03update) 10dcaro: [T400024] Allow protocol to be specified for ports [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/183 (owner: 10damian) [14:28:39] 14cloud-services-team (FY2024/2025-Q3-Q4), 14Toolforge (Toolforge iteration 21), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, and 2 others: [Hypothesis] WE6.3.10 start a beta for the push-to-deploy features - https://phabricator.wikimedia.org/T393564#11020482 (10dcaro) [14:30:28] 10Toolforge (Toolforge iteration 22): [components-api] store the config used for the deployment in the deployment themselves - https://phabricator.wikimedia.org/T400064 (10dcaro) 03NEW [14:55:51] (03update) 10damian: [T400025] Add explicit support for TCP probes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/184 [14:58:41] (03update) 10damian: [T400025] Add explicit support for TCP probes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/184 [15:04:02] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T399858) [15:04:10] T399858: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858 [15:04:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:05:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [15:07:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T399858) [15:07:38] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T399858) [15:15:56] FIRING: [8x] SystemdUnitDown: The service unit prometheus-node-pinger.service is in failed status on host cloudcephosd1010. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:20:56] FIRING: [35x] SystemdUnitDown: The service unit prometheus-node-pinger.service is in failed status on host cloudcephosd1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:49:26] RESOLVED: SystemdUnitDown: The service unit prometheus-node-pinger.service is in failed status on host cloudcephosd1033. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1033 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:49:49] (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509) [15:51:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T399858) [15:51:06] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T399858) [15:51:11] T399858: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858 [15:53:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T399858) [15:53:11] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T399858) [15:56:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T399858) [15:56:41] T399858: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858 [15:58:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:00:22] PROBLEM - Host cloudcephosd1006 is DOWN: PING CRITICAL - Packet loss = 100% [16:00:28] FIRING: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.63%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [16:01:10] RECOVERY - Host cloudcephosd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [16:02:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet1005.eqiad.wmnet' (T395255) [16:02:10] T395255: codfw1dev has seen neutron metadata agents down since epoxy upgrade - https://phabricator.wikimedia.org/T395255 [16:09:24] PROBLEM - Host cloudnet1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:11:18] RECOVERY - Host cloudnet1005 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:11:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudnet1005.eqiad.wmnet' (T395255) [16:11:33] T395255: codfw1dev has seen neutron metadata agents down since epoxy upgrade - https://phabricator.wikimedia.org/T395255 [16:36:58] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [16:36:58] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [16:38:58] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [16:43:50] RESOLVED: [4x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [16:44:53] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [16:44:58] (03open) 10dcaro: api: Add explicit support for TCP probes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/185 [16:47:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [16:47:30] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [17:01:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [17:02:11] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [17:03:20] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11021403 (10fnegri) > Let's schedule a time, I don't trust the server to reconnect automatically - though I'm positive it probably will. @YochayCO le... [17:03:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [17:04:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [17:05:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [17:05:13] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [17:09:36] (03update) 10dcaro: api: Add explicit support for TCP probes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/185 [17:43:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance project-proxy-acme-chief-03 in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:52:46] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-75 [17:58:11] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-75 [18:13:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-75 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:43:28] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance project-proxy-acme-chief-03 in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:03:12] FIRING: [2x] ProjectProxyMainProxyCertificateExpiry: Certificate for proxy on proxy-5 is about to expire (5d 18h 19m 52s to expiration) - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyCertificateExpiry [19:04:10] FIRING: ProjectProxyMainProxyInstanceDown: Proxy on proxy-6 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MainProxyInstanceDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyInstanceDown [19:06:10] FIRING: ProjectProxyMainProxyDown: Proxy service address is unreachable - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MainProxyDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyDown [19:08:12] RESOLVED: [2x] ProjectProxyMainProxyCertificateExpiry: Certificate for proxy on proxy-5 is about to expire (5d 18h 19m 52s to expiration) - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyCertificateExpiry [19:08:28] FIRING: InstanceDown: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:08:28] FIRING: [2x] TargetDown: Job app is unreachable in project quarry instance quarry.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [19:08:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [19:08:39] FIRING: QuarryDown: Quarry application is unreachable - https://prometheus-alerts.wmcloud.org/?q=alertname%3DQuarryDown [19:08:56] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [19:08:59] FIRING: [2x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [19:09:08] 06cloud-services-team: MetricsinfraAlertmanagerDown Metricsinfra alertmanager is unreachable # page - https://phabricator.wikimedia.org/T400097 (10phaultfinder) 03NEW [19:09:10] FIRING: [2x] ProjectProxyMainProxyInstanceDown: Proxy on proxy-5 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MainProxyInstanceDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyInstanceDown [19:12:33] 06cloud-services-team, 10Striker, 10CAS-SSO: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#11021861 (10Arendpieter) [19:13:28] RESOLVED: InstanceDown: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:13:57] FIRING: HarborDown: Harbor is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborDown [19:14:07] FIRING: HarborDown: Harbor is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborDown [19:16:10] RESOLVED: ProjectProxyMainProxyDown: Proxy service address is unreachable - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MainProxyDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyDown [19:18:28] RESOLVED: [2x] TargetDown: Job app is unreachable in project quarry instance quarry.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [19:18:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [19:18:39] RESOLVED: QuarryDown: Quarry application is unreachable - https://prometheus-alerts.wmcloud.org/?q=alertname%3DQuarryDown [19:18:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [19:18:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:18:56] RESOLVED: HarborDown: Harbor is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborDown [19:18:58] RESOLVED: [2x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [19:19:07] RESOLVED: HarborDown: Harbor is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborDown [19:19:10] RESOLVED: [2x] ProjectProxyMainProxyInstanceDown: Proxy on proxy-5 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MainProxyInstanceDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyInstanceDown [19:23:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:33:20] PROBLEM - SSH on cloudcephosd1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:35:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 31 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:35:16] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 31 slow ops - https://phabricator.wikimedia.org/T400104 (10phaultfinder) 03NEW [19:35:47] FIRING: NodeDown: Node cloudcephosd1006 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1006 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [19:37:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:38:10] RECOVERY - SSH on cloudcephosd1006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:39:41] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11021997 (10Andrew) cloudcephosd1006 now has a full rebuild of all OSDs and is running pacific and bookworm... [19:40:47] RESOLVED: NodeDown: Node cloudcephosd1006 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1006 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [19:44:22] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) (T399858) [19:44:31] T399858: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858 [19:45:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 131 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:45:10] FIRING: [2x] ProjectProxyMainProxyCertificateExpiry: Certificate for proxy on proxy-5 is about to expire (5d 17h 37m 52s to expiration) - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyCertificateExpiry [19:53:33] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11022038 (10VRiley-WMF) [20:02:10] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [20:02:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [20:03:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [20:07:16] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [20:07:18] !log dcaro@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [20:07:42] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [20:07:44] !log dcaro@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [20:08:00] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [20:19:59] !log dcaro@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) [20:20:12] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [20:30:28] RESOLVED: NfsAlmostFull: The NFS drive is over 85% capacity (currently 86.02%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [20:54:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [20:55:18] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [21:08:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks