[00:02:09] 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T371944#10047096 (10LibUp-bot) [00:02:11] 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T370115#10047098 (10LibUp-bot) A new upstream version of Pywikibot is now available: 9.3.1. * https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Pywikibot_image * https://gerrit.wikimedia.org/g/pywikibot/core/+/refs/tags/... [00:16:29] FIRING: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:21:29] RESOLVED: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:18:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T371878) [01:18:35] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [01:20:57] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878#10047217 (10Andrew) [02:13:24] (03update) 10raymond-ndibe: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 [02:19:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [02:35:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 7 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [02:35:16] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T370752#10047266 (10phaultfinder) [02:45:31] FIRING: ToolsToolsDBReplicationMissing: ToolsDB replication is not running on tools-db-1 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [02:50:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 15 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [02:50:31] RESOLVED: ToolsToolsDBReplicationMissing: ToolsDB replication is not running on tools-db-1 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [02:53:56] FIRING: SystemdUnitDown: The service unit clean_puppet_client_bucket.service is in failed status on host cloudcephosd1038. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1038 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:05:14] (03update) 10raymond-ndibe: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 [03:08:55] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878) [03:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [03:09:01] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [03:10:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [03:20:53] (03update) 10raymond-ndibe: [builds-cli] remove _display_messages [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/69 [03:41:28] (03open) 10raymond-ndibe: [jobs-cli] remove _display_messages [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/62 [03:41:36] (03update) 10raymond-ndibe: [jobs-cli] remove _display_messages [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/62 [04:18:56] FIRING: [2x] SystemdUnitDown: The systemd unit clean_puppet_client_bucket.service on node cloudcephosd1035 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:19:07] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T370383#10047286 (10phaultfinder) [04:19:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:23:55] (03open) 10raymond-ndibe: [envvars-cli] remove display_messages [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/57 [04:26:31] (03update) 10raymond-ndibe: [envvars-cli] remove display_messages [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/57 [04:27:12] (03update) 10raymond-ndibe: [envvars-cli] remove display_messages [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/57 [04:30:33] 10Toolforge (Toolforge iteration 14): [jobs-api] move jobs load feature to the backend - https://phabricator.wikimedia.org/T366209#10047294 (10Raymond_Ndibe) 05In progress→03Resolved [04:30:38] 10Toolforge (Toolforge iteration 14): [jobs-cli] enforce proper validation for load jobs before calculate_changes - https://phabricator.wikimedia.org/T366211#10047291 (10Raymond_Ndibe) 05In progress→03Resolved [04:31:59] 10Toolforge (Toolforge iteration 14): envvars-api 0.0.50 depends on unreleased envvars-cli changes - https://phabricator.wikimedia.org/T367961#10047299 (10Raymond_Ndibe) 05In progress→03Resolved [04:32:09] 10Toolforge (Toolforge iteration 14): [toolforge-weld] support back python 3.7 - https://phabricator.wikimedia.org/T370932#10047297 (10Raymond_Ndibe) 05In progress→03Resolved [04:37:45] 10Toolforge: toolforge jobs load flushes out all jobs - https://phabricator.wikimedia.org/T364204#10047300 (10Raymond_Ndibe) @Multichill this issue has been fixed. closing now. you can re-open if you notice something similar again [04:37:53] 10Toolforge: toolforge jobs load flushes out all jobs - https://phabricator.wikimedia.org/T364204#10047301 (10Raymond_Ndibe) 05Open→03Resolved [04:45:01] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [jobs-api] Save business models in a DB - https://phabricator.wikimedia.org/T359650#10047302 (10Raymond_Ndibe) [04:48:56] FIRING: [3x] SystemdUnitDown: The systemd unit clean_puppet_client_bucket.service on node cloudcephosd1035 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:49:08] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T370383#10047306 (10phaultfinder) [04:59:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:06:22] 10Toolforge (Toolforge iteration 14): Support HTTP health checks in jobs framework - https://phabricator.wikimedia.org/T362621#10047309 (10Raymond_Ndibe) [05:06:25] 06cloud-services-team, 10Toolforge: toolforge: integrate fourohfour as a custom component, rather than a normal tool - https://phabricator.wikimedia.org/T369364#10047311 (10Raymond_Ndibe) [05:12:59] 10Toolforge (Toolforge iteration 14): [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066#10047316 (10Raymond_Ndibe) [05:18:42] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Goal: [harbor] Create backups and/or replication - https://phabricator.wikimedia.org/T336668#10047317 (10Raymond_Ndibe) [05:18:49] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge, 05Goal: [harbor] Move harbor data to object storage service - https://phabricator.wikimedia.org/T350687#10047318 (10Raymond_Ndibe) [05:20:24] 10Toolforge (Toolforge iteration 14): [jobs-cli,jobs-api] quota shows different units for limit and usage - https://phabricator.wikimedia.org/T361120#10047319 (10Raymond_Ndibe) a:03Raymond_Ndibe [05:21:35] 10Toolforge (Toolforge iteration 14): [jobs-cli,jobs-api] quota shows different units for limit and usage - https://phabricator.wikimedia.org/T361120#10047321 (10Raymond_Ndibe) [05:25:06] 10Toolforge: [toolforge,jobs] "toolforge jobs logs" fails when job has not started yet - https://phabricator.wikimedia.org/T349775#10047324 (10Raymond_Ndibe) a:03Raymond_Ndibe [05:25:27] 10Toolforge: [maintain-harbor] Move to become a toolforge component - https://phabricator.wikimedia.org/T358225#10047322 (10Raymond_Ndibe) a:03Raymond_Ndibe [05:29:38] 10Toolforge: [jobs-api] when running a command with wrong quoting, no logs nor useful feedback is given to the user - https://phabricator.wikimedia.org/T356267#10047325 (10Raymond_Ndibe) [05:31:22] 10Toolforge: [jobs-cli] Add a new output format for toolforge jobs list command which returns the input command for scheduled jobs - https://phabricator.wikimedia.org/T356581#10047326 (10Raymond_Ndibe) [05:32:07] 10Toolforge: [jobs-api] Periodically refresh image-config data - https://phabricator.wikimedia.org/T357112#10047327 (10Raymond_Ndibe) [05:33:43] 10Toolforge: Expose Toolforge service names via environment variables - https://phabricator.wikimedia.org/T151002#10047328 (10Raymond_Ndibe) [06:39:20] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878) [06:39:25] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [07:03:56] 06cloud-services-team, 10Beta-Cluster-Infrastructure, 10Bitu, 10CAS-SSO, and 2 others: Wikitech system account and SUL for Jenkins agents? - https://phabricator.wikimedia.org/T371930#10047378 (10SLyngshede-WMF) ` 2024-08-06 20:10:04,149 WARN [org.apereo.cas.util.function.FunctionUtils] - 06cloud-services-team, 10Beta-Cluster-Infrastructure, 10Bitu, 10CAS-SSO, and 2 others: Wikitech system account and SUL for Jenkins agents? - https://phabricator.wikimedia.org/T371930#10047381 (10SLyngshede-WMF) CAS uses the following to lookup the user: ` cas.authn.ldap[0].basedn=dc=wikimedia,dc=org cas.a... [07:15:50] 06cloud-services-team, 10Beta-Cluster-Infrastructure, 10Bitu, 10CAS-SSO, and 2 others: Wikitech system account and SUL for Jenkins agents? - https://phabricator.wikimedia.org/T371930#10047401 (10SLyngshede-WMF) While I don't have the password, I've tested authenticating as jenkin-deploy on idp-test2004, an... [07:16:01] PROBLEM - Host cloudcephosd1035 is DOWN: PING CRITICAL - Packet loss = 100% [07:18:29] RECOVERY - Host cloudcephosd1035 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [07:18:56] RESOLVED: SystemdUnitDown: The service unit clean_puppet_client_bucket.service is in failed status on host cloudcephosd1035. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1035 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:23:56] FIRING: [2x] SystemdUnitDown: The service unit clean_puppet_client_bucket.service is in failed status on host cloudcephosd1035. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1035 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:38:56] RESOLVED: SystemdUnitDown: The service unit ifup@eno12409np1.service is in failed status on host cloudcephosd1035. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1035 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:40:01] PROBLEM - Host cloudcephosd1035 is DOWN: PING CRITICAL - Packet loss = 100% [07:43:47] FIRING: NodeDown: The node cloudcephosd1035 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1035 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [07:52:35] RECOVERY - Host cloudcephosd1035 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [07:53:47] RESOLVED: NodeDown: The node cloudcephosd1035 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1035 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [08:11:10] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T371878) [08:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:11:16] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [08:18:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [08:26:46] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T371878) [08:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:26:52] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [08:27:17] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.wait_for_rebalance [08:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:29:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 7 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [08:30:31] (03CR) 10David Caro: [C:03+2] WMCSCookbookRunnerBase: load the wmcs config if it's there [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059920 (owner: 10David Caro) [08:33:29] FIRING: InstanceDown: Project gitlab-runners instance runner-1025 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:33:50] FIRING: [3x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:34:13] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.015 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [08:37:28] FIRING: InstanceDown: Project cloudinfra instance cloud-cumin-04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:40:31] FIRING: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of -1 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [08:40:33] 10Cloud-Services: PetScan not responding - https://phabricator.wikimedia.org/T371955 (10Magnus) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task.... [08:41:20] 10Cloud-Services: PetScan not responding - https://phabricator.wikimedia.org/T371955#10047546 (10Magnus) p:05Triage→03Unbreak! [08:43:29] RESOLVED: InstanceDown: Project gitlab-runners instance runner-1025 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:44:05] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10047553 (10dcaro) cloudcephosd1035 has one drive that wrongly assigned as 'os raid': ` sdb... [08:44:39] (03Merged) 10jenkins-bot: WMCSCookbookRunnerBase: load the wmcs config if it's there [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059920 (owner: 10David Caro) [08:45:17] 06cloud-services-team, 10Beta-Cluster-Infrastructure, 10Bitu, 10CAS-SSO, and 2 others: Update basedn in CAS - https://phabricator.wikimedia.org/T371930#10047550 (10SLyngshede-WMF) a:03SLyngshede-WMF [08:51:07] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 21.072 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [08:51:11] 06cloud-services-team, 10Beta-Cluster-Infrastructure, 10Bitu, 10CAS-SSO, and 2 others: Update basedn in CAS - https://phabricator.wikimedia.org/T371930#10047573 (10SLyngshede-WMF) p:05Triage→03Medium We've tested modifying the basedn on test and @hashar confirms that login is now working. [08:51:55] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [08:53:16] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [08:55:39] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 14 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [08:56:46] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-19, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-11, tools-k8s-worker-nfs-41, tools-k8s-worker-nfs-13, tools-k8s-worker-nfs-50, tools-k8s-worker-nfs-54, tools-k8s-worker-nfs-23, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-22 [09:00:37] 10Cloud-Services: PetScan not responding - https://phabricator.wikimedia.org/T371955#10047587 (10Magnus) 05Open→03Resolved a:03Magnus Works again [09:00:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [09:00:58] RESOLVED: InstanceDown: Project cloudinfra instance cloud-cumin-04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:03:05] (03PS1) 10David Caro: bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 [09:05:20] RESOLVED: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:06:26] (03CR) 10CI reject: [V:04-1] bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 (owner: 10David Caro) [09:19:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:29:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:31:42] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-19, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-11, tools-k8s-worker-nfs-41, tools-k8s-worker-nfs-13, tools-k8s-worker-nfs-50, tools-k8s-worker-nfs-54, tools-k8s-worker-nfs-23, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-22 [09:34:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-19 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:36:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:36:47] 10Tools: PetScan not responding - https://phabricator.wikimedia.org/T371955#10047630 (10Aklapper) [09:39:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-19 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:49:03] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-11 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:59:03] RESOLVED: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-11 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:10:42] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-sgebastion-10 [10:11:00] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-sgebastion-10 [10:11:42] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [10:12:02] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [10:12:50] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-7, tools-k8s-worker-nfs-9 [10:21:28] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:24:20] 10toolforge_i18n, 10Tools, 07I18n, 03Wikimania-Hackathon-2024: Extract Python library for Wikimedia tool i18n from Wikidata Lexeme Forms tool - https://phabricator.wikimedia.org/T283376#10047729 (10LucasWerkmeister) The library is making good progress; the biggest TODO left is better documentation, which I... [10:30:27] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-7, tools-k8s-worker-nfs-9 [11:19:49] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.wait_for_rebalance (exit_code=0) [12:06:29] (03CR) 10FNegri: [C:03+1] "LGTM!" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059925 (owner: 10David Caro) [13:08:23] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1037.eqi... [13:16:05] 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T371944#10048149 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/444 [13:16:13] vivian-rook opened https://github.com/toolforge/paws/pull/444 [13:17:24] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-42, tools-k8s-worker-nfs-8, tools-k8s-worker-nfs-12, tools-k8s-worker-nfs-21, tools-k8s-worker-nfs-38, tools-k8s-worker-nfs-47, tools-k8s-worker-nfs-55, tools-k8s-worker-nfs-43 [13:17:25] wmbot~dcaro@urcuchillay: Failed to log message to wiki. Somebody should check the error logs. [13:27:28] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 10Data-Platform-SRE (2024.07.29 - 2024.08.16), 13Patch-For-Review: Remove or replace deployment-snapshot03.deployment-prep.eqiad1.wikimedia.cloud (Buster depre... - https://phabricator.wikimedia.org/T370465#10048205 [13:34:06] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 10Data-Platform-SRE (2024.07.29 - 2024.08.16), 13Patch-For-Review: Remove or replace deployment-snapshot03.deployment-prep.eqiad1.wikimedia.cloud (Buster depr... - https://phabricator.wikimedia.org/T370465#10048248 [13:34:12] 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm - https://phabricator.wikimedia.org/T327742#10048252 (10BTullis) [13:35:47] (03PS2) 10David Caro: bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 [13:36:45] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: Replace deployment-eventlog08 with Bullseye or Bookworm host - https://phabricator.wikimedia.org/T369918#10048254 (10BTullis) @Ottomata - Do you think we could delete this host, rather than replace it? Or do we sti... [13:47:13] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048325 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1037.eqiad.w... [13:48:44] (03PS1) 10David Caro: ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 [13:51:36] (03CR) 10CI reject: [V:04-1] bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 (owner: 10David Caro) [13:55:10] (03CR) 10David Caro: "recheck" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 (owner: 10David Caro) [14:02:00] 06cloud-services-team, 10DNS: Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621#10048375 (10ayounsi) [14:04:19] (03PS1) 10David Caro: ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 [14:04:35] (03CR) 10CI reject: [V:04-1] ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 (owner: 10David Caro) [14:05:15] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-42, tools-k8s-worker-nfs-8, tools-k8s-worker-nfs-12, tools-k8s-worker-nfs-21, tools-k8s-worker-nfs-38, tools-k8s-worker-nfs-47, tools-k8s-worker-nfs-55, tools-k8s-worker-nfs-43 [14:05:16] wmbot~dcaro@urcuchillay: Failed to log message to wiki. Somebody should check the error logs. [14:07:39] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T363344) [14:07:40] wmbot~dcaro@urcuchillay: Failed to log message to wiki. Somebody should check the error logs. [14:07:41] T363344: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344 [14:13:30] (03PS3) 10David Caro: bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 [14:13:30] (03PS2) 10David Caro: ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 [14:13:30] (03PS2) 10David Caro: ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 [14:13:30] (03PS1) 10David Caro: tox: skip py312 as spicerack does not support it yet [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060453 [14:13:54] (03CR) 10CI reject: [V:04-1] bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 (owner: 10David Caro) [14:13:56] (03CR) 10CI reject: [V:04-1] ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 (owner: 10David Caro) [14:14:00] (03CR) 10CI reject: [V:04-1] ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 (owner: 10David Caro) [14:14:01] (03CR) 10CI reject: [V:04-1] tox: skip py312 as spicerack does not support it yet [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060453 (owner: 10David Caro) [14:14:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:15:36] (03PS2) 10David Caro: tox: skip py312 as spicerack does not support it yet [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060453 [14:15:36] (03PS4) 10David Caro: bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 [14:15:36] (03PS3) 10David Caro: ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 [14:15:36] (03PS3) 10David Caro: ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 [14:17:41] (03CR) 10CI reject: [V:04-1] ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 (owner: 10David Caro) [14:18:19] (03PS3) 10David Caro: tox: skip py312 as spicerack does not support it yet [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060453 (https://phabricator.wikimedia.org/T354410) [14:18:21] (03PS5) 10David Caro: bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 [14:18:21] (03PS4) 10David Caro: ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 [14:18:21] (03PS4) 10David Caro: ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 [14:18:30] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T363344) [14:18:32] wmbot~dcaro@urcuchillay: Failed to log message to wiki. Somebody should check the error logs. [14:18:33] T363344: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344 [14:18:53] (03CR) 10Andrew Bogott: [C:03+1] "lgtm" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060453 (https://phabricator.wikimedia.org/T354410) (owner: 10David Caro) [14:19:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:27:23] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 10Data-Platform-SRE (2024.07.29 - 2024.08.16), 13Patch-For-Review: Remove or replace deployment-snapshot03.deployment-prep.eqiad1.wikimedia.cloud (Buster depre... - https://phabricator.wikimedia.org/T370465#10048432 [14:28:12] (03CR) 10CI reject: [V:04-1] ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 (owner: 10David Caro) [14:28:49] (03CR) 10CI reject: [V:04-1] ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 (owner: 10David Caro) [14:29:02] (03CR) 10CI reject: [V:04-1] tox: skip py312 as spicerack does not support it yet [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060453 (https://phabricator.wikimedia.org/T354410) (owner: 10David Caro) [14:29:29] (03CR) 10CI reject: [V:04-1] bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 (owner: 10David Caro) [14:29:49] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 10Data-Platform-SRE (2024.07.29 - 2024.08.16), 13Patch-For-Review: Remove or replace deployment-snapshot03.deployment-prep.eqiad1.wikimedia.cloud (Buster depre... - https://phabricator.wikimedia.org/T370465#10048438 [14:36:01] 06Toolforge-standards-committee: Adoption request for vrb (VimeoReviewBot) - https://phabricator.wikimedia.org/T338556#10048449 (10Novem_Linguae) [14:37:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:42:43] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: Replace deployment-eventlog08 with Bullseye or Bookworm host - https://phabricator.wikimedia.org/T369918#10048472 (10Ottomata) WE ARE SO CLOSE TO DELETING. I had hoped to be done already but keep encountering anno... [14:43:05] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: Replace deployment-eventlog08 with Bullseye or Bookworm host - https://phabricator.wikimedia.org/T369918#10048473 (10Ottomata) Oh! wait this is in beta! Yes, we can delete this. There should be no use for this in... [14:45:40] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 10Data-Platform-SRE (2024.07.29 - 2024.08.16), 13Patch-For-Review: Remove or replace deployment-snapshot03.deployment-prep.eqiad1.wikimedia.cloud (Buster depre... - https://phabricator.wikimedia.org/T370465#10048492 [14:48:19] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: Replace deployment-eventlog08 with Bullseye or Bookworm host - https://phabricator.wikimedia.org/T369918#10048500 (10BTullis) a:03BTullis >>! In T369918#10048473, @Ottomata wrote: > Oh! wait this is in beta! > >... [14:51:02] 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm - https://phabricator.wikimedia.org/T327742#10048518 (10BTullis) [14:52:29] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: Replace deployment-eventlog08 with Bullseye or Bookworm host - https://phabricator.wikimedia.org/T369918#10048510 (10BTullis) 05Open→03Resolved Oh it looks like it was already shut down. Not sure when that... [15:00:33] (03PS4) 10David Caro: tox: skip py312 as spicerack does not support it yet [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060453 (https://phabricator.wikimedia.org/T354410) [15:00:33] (03PS6) 10David Caro: bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 [15:00:33] (03PS5) 10David Caro: ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 [15:00:33] (03PS5) 10David Caro: ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 [15:02:11] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [15:02:13] wmbot~dcaro@urcuchillay: Failed to log message to wiki. Somebody should check the error logs. [15:02:25] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [15:02:25] wmbot~dcaro@urcuchillay: Failed to log message to wiki. Somebody should check the error logs. [15:03:59] (03CR) 10CI reject: [V:04-1] ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 (owner: 10David Caro) [15:04:17] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [15:04:17] wmbot~dcaro@urcuchillay: Failed to log message to wiki. Somebody should check the error logs. [15:06:44] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [15:06:45] wmbot~dcaro@urcuchillay: Failed to log message to wiki. Somebody should check the error logs. [15:11:10] 10Cloud-VPS: PetScan not responding - https://phabricator.wikimedia.org/T371955#10048592 (10JJMC89) [15:12:26] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.wait_for_rebalance [15:12:27] wmbot~dcaro@urcuchillay: Failed to log message to wiki. Somebody should check the error logs. [15:13:52] (03CR) 10David Caro: [C:03+2] tox: skip py312 as spicerack does not support it yet [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060453 (https://phabricator.wikimedia.org/T354410) (owner: 10David Caro) [15:14:55] (03PS6) 10David Caro: ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 [15:14:55] (03PS6) 10David Caro: ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 [15:15:31] (03CR) 10David Caro: [C:03+2] bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 (owner: 10David Caro) [15:18:01] (03Merged) 10jenkins-bot: tox: skip py312 as spicerack does not support it yet [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060453 (https://phabricator.wikimedia.org/T354410) (owner: 10David Caro) [15:18:11] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 10Data-Platform-SRE (2024.07.29 - 2024.08.16), 13Patch-For-Review: Remove or replace deployment-snapshot03.deployment-prep.eqiad1.wikimedia.cloud (Buster depr... - https://phabricator.wikimedia.org/T370465#10048607 [15:19:05] (03CR) 10CI reject: [V:04-1] ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 (owner: 10David Caro) [15:19:14] (03CR) 10CI reject: [V:04-1] ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 (owner: 10David Caro) [15:19:57] (03Merged) 10jenkins-bot: bootstrap_and_add: ask only once for device destroy ack [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060395 (owner: 10David Caro) [15:21:30] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048620 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eq... [15:23:40] (03PS7) 10David Caro: ceph.bootstrap_and_add: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060451 [15:23:41] (03PS7) 10David Caro: ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 [15:30:01] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:30:03] dcaro@cloudcumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:36:24] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad.... [15:37:19] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eq... [15:39:36] !log dcaro@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=97) [15:39:37] dcaro@cloudcumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:41:06] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [15:41:15] !log dcaro@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [15:55:49] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 4 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10048714 (10bd808) [16:01:21] (03PS8) 10David Caro: ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 [16:02:07] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [16:02:12] !log dcaro@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:05:37] (03CR) 10CI reject: [V:04-1] ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 (owner: 10David Caro) [16:15:48] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048836 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad.... [16:23:17] (03PS9) 10David Caro: ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 [16:23:17] (03PS1) 10David Caro: ceph.drain*: use --osd-hostname and --cluster-name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060469 [16:23:17] (03PS1) 10David Caro: undrain_node: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060470 [16:24:57] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [16:25:02] !log dcaro@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:27:43] (03CR) 10CI reject: [V:04-1] ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 (owner: 10David Caro) [16:27:43] (03CR) 10CI reject: [V:04-1] undrain_node: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060470 (owner: 10David Caro) [16:27:59] (03CR) 10CI reject: [V:04-1] ceph.drain*: use --osd-hostname and --cluster-name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060469 (owner: 10David Caro) [16:28:27] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [16:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:28:46] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:29:24] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 4 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10048887 (10brennen) > Change #1060468 had a related patch set up... [16:30:43] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [16:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:31:02] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:31:54] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [16:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:32:18] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:33:44] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [16:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:34:01] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:34:25] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [16:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:35:33] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:36:39] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [16:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:38:12] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:38:15] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [16:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:38:56] FIRING: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:39:45] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:39:48] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [16:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:40:46] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:43:56] RESOLVED: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:45:12] 10PAWS: Non-notebook files don't redirect to paws-public when URL is changed - https://phabricator.wikimedia.org/T143459#10048969 (10Pppery) [16:45:19] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.drain_node [16:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:46:17] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) [16:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:49:27] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [16:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:50:17] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:50:24] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.drain_node [16:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:51:09] (03PS10) 10David Caro: ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 [16:51:09] (03PS2) 10David Caro: ceph.drain*: use --osd-hostname and --cluster-name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060469 [16:51:09] (03PS2) 10David Caro: undrain_node: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060470 [16:51:09] (03PS1) 10David Caro: ceph.drain/undrain_node: allow filtering by osd-id [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060473 [16:53:56] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 06DBA: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066#10048980 (10BTullis) We are still experiencing a failure relating to `btmwiki` at the beginning of each month. It is something to do with the grants o... [16:55:02] (03CR) 10CI reject: [V:04-1] ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 (owner: 10David Caro) [16:55:13] (03CR) 10CI reject: [V:04-1] ceph.drain/undrain_node: allow filtering by osd-id [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060473 (owner: 10David Caro) [16:55:14] (03CR) 10CI reject: [V:04-1] undrain_node: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060470 (owner: 10David Caro) [16:55:23] (03CR) 10CI reject: [V:04-1] ceph.drain*: use --osd-hostname and --cluster-name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060469 (owner: 10David Caro) [16:56:13] (03PS11) 10David Caro: ceph.undrain: use the size of the drive in TiB as weight [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060443 [16:56:13] (03PS3) 10David Caro: ceph.drain*: use --osd-hostname and --cluster-name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060469 [16:56:13] (03PS3) 10David Caro: undrain_node: wait by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060470 [16:56:13] (03PS2) 10David Caro: ceph.drain/undrain_*: allow filtering by osd-id [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060473 [16:56:33] !log dcaro@urcuchillay admin END (ERROR) - Cookbook wmcs.ceph.wait_for_rebalance (exit_code=97) [16:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:00:25] 10VPS-project-Codesearch: Index https://gitlab.wikimedia.org/toolforge-repos/ repos - https://phabricator.wikimedia.org/T371992 (10bd808) 03NEW [17:02:17] (03PS2) 10David Caro: ceph.{drain,undrain}: fix chunking [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060173 [17:02:19] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) [17:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:02:26] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [17:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:03:08] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [17:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:03:16] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [17:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:03:32] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [17:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:04:45] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [17:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:05:37] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [17:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:05:41] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [17:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:06:29] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [17:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:06:40] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [17:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:07:11] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [17:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:07:19] (03PS3) 10David Caro: ceph.{drain,undrain}: fix chunking [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1060173 [17:07:20] 10VPS-project-Codesearch: Index known popular MediaWiki client libraries - https://phabricator.wikimedia.org/T371993 (10bd808) 03NEW [17:10:44] (03CR) 10David Caro: [C:03+2] openstack.tofu: use gitlab token from wmcs config [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059925 (owner: 10David Caro) [17:11:55] 10VPS-project-Codesearch: Index known popular MediaWiki client libraries - https://phabricator.wikimedia.org/T371993#10049054 (10bd808) Determining what to index for this feels like an open question. There are lists of clients at https://www.mediawiki.org/wiki/API:Client_code and https://www.mediawiki.org/wiki/A... [17:14:39] (03Merged) 10jenkins-bot: openstack.tofu: use gitlab token from wmcs config [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059925 (owner: 10David Caro) [17:19:33] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.wait_for_rebalance [17:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:53:03] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.wait_for_rebalance (exit_code=0) [17:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:54:34] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10049148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eq... [17:56:11] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [17:57:08] 10VPS-project-Codesearch: Index https://gitlab.wikimedia.org/toolforge-repos/ repos - https://phabricator.wikimedia.org/T371992#10049153 (10Bugreporter) Per {T268196} I think we should index all primary (non-fork) GitLab repos instead, since GitLab CE does not have any global search feature. [18:01:54] 10VPS-project-Codesearch: Index https://gitlab.wikimedia.org/toolforge-repos/ repos - https://phabricator.wikimedia.org/T371992#10049161 (10Dzahn) This looks like an example change where a repo was added to codesearch in the past: https://gerrit.wikimedia.org/r/c/labs/codesearch/+/414060 [18:03:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:13:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:17:13] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [18:17:40] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [18:19:30] 10Tool-Pageviews: pageviews tool doesn't work in several newer wikis - https://phabricator.wikimedia.org/T371997 (10Amire80) 03NEW [18:19:56] FIRING: CloudVPSDesignateLeaks: Detected 5 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:24:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:32:16] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10049245 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad.... [18:33:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [18:34:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:43:22] 10VPS-project-Codesearch: Index https://gitlab.wikimedia.org/toolforge-repos/ repos - https://phabricator.wikimedia.org/T371992#10049412 (10bd808) I found that https://gerrit.wikimedia.org/r/plugins/gitiles/labs/codesearch/+/fae2553e35f901c6f678cb5b696681a35df1cd50/write_config.py#469 is configuring codesearch t... [19:54:05] (03PS1) 10BryanDavis: config: Index https://gitlab.wikimedia.org/toolforge-repos/* [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1060493 (https://phabricator.wikimedia.org/T371992) [19:56:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [19:59:06] PROBLEM - Host cloudcephosd1037 is DOWN: PING CRITICAL - Packet loss = 100% [20:01:38] ACKNOWLEDGEMENT - SSH on cloudcephosd1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Andrew Bogott rebooting from a non-icinga-enabled cookbook https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:01:38] ACKNOWLEDGEMENT - Host cloudcephosd1037 is DOWN: PING CRITICAL - Packet loss = 100% Andrew Bogott rebooting from a non-icinga-enabled cookbook [20:03:02] RECOVERY - Host cloudcephosd1037 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [20:03:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [20:06:20] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [20:07:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [20:07:57] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 4 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10049480 (10brennen) [20:08:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [20:08:32] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 4 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10049505 (10brennen) Removing as train blocker for .17, leaving o... [20:10:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.set_maintenance [20:11:08] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.set_maintenance (exit_code=97) [20:11:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [20:15:05] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 6 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10049550 (10bd808) >>! In T371977#10048706, @bd808 wrote: > the... [20:17:24] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 6 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10049556 (10LucasWerkmeister) >>! In T371977#10048887, @brennen w... [20:18:10] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 6 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10049558 (10LucasWerkmeister) p:05Unbreak!→03Triage [20:29:41] RESOLVED: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:00:12] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 6 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10049711 (10LucasWerkmeister) >>! In T371977#10049550, @bd808 wro... [21:05:18] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 6 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10049722 (10bd808) >>! In T371977#10049711, @LucasWerkmeister wro... [21:05:56] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [ceph,network] Intermittent network packets lost - https://phabricator.wikimedia.org/T371869#10049717 (10Dzahn) Also see: T371879#10049699 Something created a large traffic spike between cloudsw1-d5 and cloudsw1-f4 today. [21:49:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:59:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:39:32] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 6 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10049882 (10Krinkle) "I told you so". I specifically amended ht... [23:23:22] (03open) 10raymond-ndibe: Draft: [maintain-kubeusers] increment default quota for pods, cpu, mem [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/58 (https://phabricator.wikimedia.org/T341066) [23:24:18] (03open) 10raymond-ndibe: Draft: [jobs-api] multi-replica support for continuous jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/115 (https://phabricator.wikimedia.org/T341066) [23:24:47] (03update) 10raymond-ndibe: Draft: [jobs-api] multi-replica support for continuous jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/115 (https://phabricator.wikimedia.org/T341066) [23:25:09] (03update) 10raymond-ndibe: Draft: [maintain-kubeusers] increment default quota for pods, cpu, mem [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/58 (https://phabricator.wikimedia.org/T341066)