[00:04:31] 06Toolforge-standards-committee: Adoption request for geograph2commons - https://phabricator.wikimedia.org/T345707#10053311 (10bd808) >>! In T345707#10053246, @bjh21 wrote: > I made this request in response to [[ https://commons.wikimedia.org/wiki/Commons:Help_desk/Archive/2023/08#Transfer_from_Geograph | a thre... [00:16:29] FIRING: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:21:29] RESOLVED: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:46:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T371878) [00:47:04] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [00:47:10] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T371878) [00:47:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T371878) [00:47:42] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878#10053322 (10Andrew) [01:51:18] 06cloud-services-team, 06DC-Ops, 10ops-codfw, 06SRE: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10053337 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm execut... [03:16:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:25:51] (03update) 10samwilson: Add hourly update-focus-areas command [toolforge-repos/wishlist] - 10https://gitlab.wikimedia.org/toolforge-repos/wishlist/-/merge_requests/1 (https://phabricator.wikimedia.org/T363240 https://phabricator.wikimedia.org/T364648) [04:28:09] (03update) 10samwilson: Add hourly update-focus-areas command [toolforge-repos/wishlist] - 10https://gitlab.wikimedia.org/toolforge-repos/wishlist/-/merge_requests/1 (https://phabricator.wikimedia.org/T364648) [04:50:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:06:56] FIRING: SystemdUnitDown: The service unit backup_vms.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:29:03] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878#10053420 (10Andrew) [05:35:08] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T371878) [05:35:14] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [05:36:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T371878) [06:59:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [07:01:56] FIRING: SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:02:06] 06cloud-services-team: SystemdUnitDown Unit backup_vms.service on node cloudbackup1003 has been down for long. - https://phabricator.wikimedia.org/T372126 (10phaultfinder) 03NEW [07:23:31] 10Toolforge: Java application redeploys several times until it starts - https://phabricator.wikimedia.org/T372092#10053514 (10Benjavalero) Today I have seen that along the day there was another restart at 14.41 UTC, this time with a (maybe useful) trace: ` 2024-08-08 14:05:37,826 DEBUG [uler-2] e.b.r.f.l.load.Li... [08:19:10] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [08:21:56] 10Cloud-VPS (Quota-requests): [Quota increase]: globaleducation - https://phabricator.wikimedia.org/T372134 (10Ragesoss) 03NEW [08:25:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [08:32:22] (03CR) 10Thiemo Kreuz (WMDE): [C:04-1] Fix the typo error from one to on (031 comment) [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1059930 (owner: 10GauriGuptaa) [09:06:56] FIRING: SystemdUnitDown: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:41:56] FIRING: [2x] SystemdUnitDown: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:01:56] FIRING: [2x] SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:02:06] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T370383#10053862 (10phaultfinder) [11:27:15] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878) [11:27:20] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [12:25:24] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [12:36:56] RESOLVED: SystemdUnitDown: The service unit backup_vms.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:36:56] FIRING: SystemdUnitDown: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:37:02] 06cloud-services-team: SystemdUnitDown Unit purge_vm_backup.service on node cloudbackup1004 has been down for long. - https://phabricator.wikimedia.org/T372143 (10phaultfinder) 03NEW [12:41:56] RESOLVED: SystemdUnitDown: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:46:49] 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T371944#10054012 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/444 [12:46:54] 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T371944#10054013 (10rook) 05Open→03Resolved a:03rook [12:47:01] vivian-rook closed https://github.com/toolforge/paws/pull/444 [12:55:35] 10VPS-project-Codesearch: mwclient should be indexed by codesearch - https://phabricator.wikimedia.org/T372144 (10Tgr) 03NEW [12:58:14] 10VPS-project-Codesearch: AWB should be indexed by codesearch - https://phabricator.wikimedia.org/T372145 (10Tgr) 03NEW [13:00:23] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 6 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10054051 (10Tgr) >>! In T371977#10049882, @Krinkle wrote: > I spe... [13:12:50] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 6 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10054081 (10AdamWill) mwclient-side fix is merged and I intend to... [13:14:25] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878#10054082 (10Andrew) [13:16:05] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 6 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10054089 (10Tgr) [13:16:16] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 6 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10054091 (10Tgr) [13:18:53] 10VPS-project-Codesearch: Index known popular MediaWiki client libraries - https://phabricator.wikimedia.org/T371993#10054103 (10Tgr) [13:18:54] 10VPS-project-Codesearch: AWB should be indexed by codesearch - https://phabricator.wikimedia.org/T372145#10054101 (10Tgr) →14Duplicate dup:03T371993 [13:19:48] 10VPS-project-Codesearch: Index known popular MediaWiki client libraries - https://phabricator.wikimedia.org/T371993#10054098 (10Tgr) [13:20:31] 10VPS-project-Codesearch: mwclient should be indexed by codesearch - https://phabricator.wikimedia.org/T372144#10054096 (10Tgr) →14Duplicate dup:03T371993 [13:22:06] 10VPS-project-Codesearch: Index known popular MediaWiki client libraries - https://phabricator.wikimedia.org/T371993#10054106 (10Tgr) The other major fallout was {T372017}. AWB is still using SVN so that sounds like a challenge. [13:27:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:34:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T371878) [13:34:20] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [13:35:53] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T371878) [13:36:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T371878) [13:38:28] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T371878) [13:38:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T371878) [14:07:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:15:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T371878) [14:19:41] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:12:55] 06cloud-services-team, 10wikitech.wikimedia.org, 06Trust-and-Safety: Account recovery help needed for Developer account [gabina] - https://phabricator.wikimedia.org/T372153 (10Gabinaluz) 03NEW [15:15:20] 10Cloud-VPS (Quota-requests): [Quota increase]: globaleducation - https://phabricator.wikimedia.org/T372134#10054411 (10Slst2020) +1 [15:17:49] 10Cloud-VPS (Quota-requests): [Quota increase]: globaleducation - https://phabricator.wikimedia.org/T372134#10054413 (10Slst2020) a:03Slst2020 [15:22:18] !log sstefanova@cloudcumin1001 globaleducation START - Cookbook wmcs.openstack.quota_increase [15:22:26] !log sstefanova@cloudcumin1001 globaleducation END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) [15:25:56] 10Cloud-VPS (Quota-requests): [Quota increase]: globaleducation - https://phabricator.wikimedia.org/T372134#10054438 (10Slst2020) 05Open→03Resolved Done; please reopen the ticket when you no longer need the extra quota. :) [15:29:41] RESOLVED: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:30:55] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "wikidumpparse" project Buster deprecation - https://phabricator.wikimedia.org/T367561#10054599 (10Maximilianklein) @andrew , confirmed. That is my plan this next week. To get this done. [ ] create cinder volume. [ ] move project code [ ] move mysql-db files [... [17:00:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:26:27] 10Tool-wikiloves: WLE in the Democratic Republic of the Congo - https://phabricator.wikimedia.org/T372166 (10CapitainAfrika) 03NEW [18:24:01] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 6 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10054698 (10DavidBrooks) To the comment on breaking-or-not //site... [18:35:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:35:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:39:36] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878) [18:39:41] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [18:40:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:41:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [19:16:39] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878) [19:16:45] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [19:41:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:20:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [23:08:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses