[00:04:44] (03update) 10raymond-ndibe: [jobs-api] convert all quotas to appropriate units [repos/cloud/toolforge/jobs-api] (refactor_validate_kube_quant) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/119 (https://phabricator.wikimedia.org/T361120) [00:07:29] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:08:29] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [jobs-cli,jobs-api] quota shows different units for limit and usage - https://phabricator.wikimedia.org/T361120#10071889 (10Raymond_Ndibe) 05Open→03In progress [00:12:29] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:15:28] FIRING: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:20:28] RESOLVED: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:20:13] (03CR) 10Legoktm: [C:03+2] "Thanks!" [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1060493 (https://phabricator.wikimedia.org/T371992) (owner: 10BryanDavis) [03:21:15] (03Merged) 10jenkins-bot: config: Index https://gitlab.wikimedia.org/toolforge-repos/* [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1060493 (https://phabricator.wikimedia.org/T371992) (owner: 10BryanDavis) [05:47:55] 10Cloud-VPS, 10Bitu, 06Infrastructure-Foundations: Can't activate my new key using the idm.wikimedia.org (bitu) interface - https://phabricator.wikimedia.org/T372581#10071977 (10Meno25) [06:44:44] 10Cloud-VPS, 10Bitu, 06Infrastructure-Foundations: Can't activate my new key using the idm.wikimedia.org (bitu) interface - https://phabricator.wikimedia.org/T372581#10071998 (10SLyngshede-WMF) Thank you for a very good bug report, it's appreciated [06:44:49] 10Cloud-VPS, 10Bitu, 06Infrastructure-Foundations: Can't activate my new key using the idm.wikimedia.org (bitu) interface - https://phabricator.wikimedia.org/T372581#10071994 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03High a:03SLyngshede-WMF Hi, could you do a quick test for me? We had an... [07:05:35] 10Cloud-VPS, 10Bitu, 06Infrastructure-Foundations: Can't activate my new key using the idm.wikimedia.org (bitu) interface - https://phabricator.wikimedia.org/T372581#10072001 (10Meno25) >>! In T372581#10071994, @SLyngshede-WMF wrote: > Hi, could you do a quick test for me? We had another user with a similar... [09:17:47] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 10Cumin, 06Infrastructure-Foundations, 13Patch-For-Review: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453#10072423 (10fnegri) [09:18:34] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [wmcs-backup] Backup snapshots of deleted volumes are never cleaned up - https://phabricator.wikimedia.org/T358774#10072425 (10fnegri) [09:18:35] 10cloud-services-team (FY2024/2025-Q1-Q2), 13Patch-For-Review: cloudgw: add cloud-private subnet support - https://phabricator.wikimedia.org/T338334#10072427 (10fnegri) [09:19:54] 10cloud-services-team (FY2024/2025-Q1-Q2), 10superset.wmcloud.org: Allow Superset to query ToolsDB public databases - https://phabricator.wikimedia.org/T367393#10072435 (10fnegri) [09:20:23] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479#10072433 (10fnegri) [09:20:27] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457#10072429 (10fnegri) [09:21:47] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudceph] Slow operations - tracking task - https://phabricator.wikimedia.org/T334240#10072437 (10fnegri) [09:22:06] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: Migrate Cloud VPS to Neutron Open vSwitch agent - https://phabricator.wikimedia.org/T326373#10072431 (10fnegri) [09:22:08] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Quarry: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158#10072439 (10fnegri) [09:22:19] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 05Goal: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#10072449 (10fnegri) [09:23:30] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Quarry: Allow Quarry to query its own database - https://phabricator.wikimedia.org/T367415#10072441 (10fnegri) [09:23:45] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10072445 (10fnegri) [09:24:15] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338#10072447 (10fnegri) [09:25:09] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner - https://phabricator.wikimedia.org/T329709#10072455 (10fnegri) [09:25:13] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 10Infrastructure Security: wikireplicas root access - https://phabricator.wikimedia.org/T344599#10072460 (10fnegri) [09:25:17] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services: [wikireplicas] Views flaggedpage_pending and flaggedtemplates are broken - https://phabricator.wikimedia.org/T368939#10072461 (10fnegri) [09:25:21] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Platform-SRE: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427#10072462 (10fnegri) [09:25:27] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Puppet-Infrastructure, 13Patch-For-Review: Ownership confusion on cloud-local puppet servers - https://phabricator.wikimedia.org/T364492#10072463 (10fnegri) [09:25:35] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 10Toolforge, 07Documentation: Restructure and improve content for: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database - https://phabricator.wikimedia.org/T232404#10072464 (10fnegri) [09:25:41] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge: [toolforge] webservice logs crashes with some unicode chars - https://phabricator.wikimedia.org/T364609#10072465 (10fnegri) [09:25:45] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge: [docs] Create a tutorial on how to deploy a Node.js app using Build Service - https://phabricator.wikimedia.org/T353313#10072466 (10fnegri) [09:25:50] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Use BGP to announce VM ranges from cloudnet to cloudgw - https://phabricator.wikimedia.org/T358868#10072467 (10fnegri) [09:25:52] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 10Toolforge, 10Observability-Alerting, 05Goal: Move WMCS off of Icinga and introduce alertmanager - https://phabricator.wikimedia.org/T328502#10072451 (10fnegri) [09:25:53] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [nova-api,cloudrabbit] Connectivity issues from all cloudcontrols to all cloudrabbit nodes - https://phabricator.wikimedia.org/T356621#10072469 (10fnegri) [09:25:55] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge: [tools.meta] can't delete file inside cache/wikimedia-wikis.dat - https://phabricator.wikimedia.org/T357098#10072468 (10fnegri) [09:25:57] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820#10072470 (10fnegri) [09:26:01] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Goal: [harbor] Move harbor data to object storage service - https://phabricator.wikimedia.org/T350687#10072471 (10fnegri) [09:26:05] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Goal: [harbor] Deploy with Helm - https://phabricator.wikimedia.org/T356301#10072473 (10fnegri) [09:26:06] 10Cloud Services Proposals, 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Cloud-Services-Origin-Team, and 3 others: [Epic,builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-deploy and build... - https://phabricator.wikimedia.org/T194332#10072472 [09:26:10] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Unplanned: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354294#10072475 (10fnegri) [09:26:14] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Unplanned: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354295#10072474 (10fnegri) [09:26:19] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [puppetmaster-02.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud] puppet failing to run - https://phabricator.wikimedia.org/T353048#10072476 (10fnegri) [09:26:22] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [tf-infra-tests] Failing to destroy - volumes stuck - https://phabricator.wikimedia.org/T352895#10072477 (10fnegri) [09:26:27] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [openstack] cloudcontrols getting out of space due to nova-api.log message 'XXX lineno: 104, opcode: 120' - https://phabricator.wikimedia.org/T352635#10072478 (10fnegri) [09:26:31] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Maintenance: [webservice shell] Allow a user to delete/stop all running shell pods - https://phabricator.wikimedia.org/T349733#10072479 (10fnegri) [09:26:35] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [ceph] Enable disk failure prediciton - https://phabricator.wikimedia.org/T349694#10072480 (10fnegri) [09:26:39] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [wmcs-cookbooks] add a cookbook to reboot a cloudservices/cloudlb host - https://phabricator.wikimedia.org/T348841#10072481 (10fnegri) [09:26:43] 10cloud-services-team (FY2024/2025-Q1-Q2), 06Infrastructure-Foundations: Remove wmcs-admin access from production cumin hosts - https://phabricator.wikimedia.org/T347979#10072483 (10fnegri) [09:26:47] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [wmcs-cookbooks,toolforge,nfs] automate cleanup of D state webservices by deleting the stuck pod - https://phabricator.wikimedia.org/T348662#10072482 (10fnegri) [09:26:51] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [wmcs-cookbooks] changes to openstack cli / auth things broke several cookbooks - https://phabricator.wikimedia.org/T346427#10072485 (10fnegri) [09:26:55] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 10Data-Services: cloudcumin: allow wmcs-admin to run wikireplicas cookbooks and scripts - https://phabricator.wikimedia.org/T347977#10072484 (10fnegri) [09:26:59] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Move Cloud VPS control plane alerting to alertmanager - https://phabricator.wikimedia.org/T345294#10072486 (10fnegri) [09:27:03] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Trove: tmpdir should be in external volume - https://phabricator.wikimedia.org/T336285#10072487 (10fnegri) [09:27:07] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services: [toolsdb] set gtid_domain_id to 0 - https://phabricator.wikimedia.org/T357341#10072488 (10fnegri) [09:27:11] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [wmcs-backup] Race condition between backup and cleanup timers - https://phabricator.wikimedia.org/T358780#10072489 (10fnegri) [09:27:15] 10cloud-services-team (FY2024/2025-Q1-Q2): Test using phabricator-maintenance-bot to sync wmcs-related boards - https://phabricator.wikimedia.org/T358251#10072490 (10fnegri) [09:27:19] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 07Documentation: Toolforge admin docs: revise new navigation menu and add category labels - https://phabricator.wikimedia.org/T345109#10072491 (10fnegri) [09:27:23] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned, 13Patch-For-Review: [promethus,haproxy] Move to haproxy internal metrics from haproxy_exporter - https://phabricator.wikimedia.org/T343885#10072492 (10fnegri) [09:27:27] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudvps] use a systemd timer for the OpenTofu tests to get logs - https://phabricator.wikimedia.org/T341769#10072494 (10fnegri) [09:27:31] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudvps] puppetize the OpenTofu tests VM (tf-infra-test) - https://phabricator.wikimedia.org/T341814#10072493 (10fnegri) [09:27:35] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [etcd,infra] Find a backup solution for the etcd database - https://phabricator.wikimedia.org/T339934#10072495 (10fnegri) [09:27:39] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [helmfile] Toolforge needs helmfile >=/0.145.3, but we have 0.135.0 - https://phabricator.wikimedia.org/T339328#10072496 (10fnegri) [09:27:43] 10cloud-services-team (FY2024/2025-Q1-Q2): Agree how to track/find all WMCS tasks that have a common topic, but belong to different projects - https://phabricator.wikimedia.org/T336681#10072497 (10fnegri) [09:27:47] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Maintenance: [horizon] Log in timing out due to nutcracker being stopped - https://phabricator.wikimedia.org/T333561#10072499 (10fnegri) [09:27:51] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network - https://phabricator.wikimedia.org/T329778#10072500 (10fnegri) [09:27:55] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: wmcs cookbooks: automate reset nova state of a VM - https://phabricator.wikimedia.org/T336678#10072498 (10fnegri) [09:27:59] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, and 2 others: [maintain-dbusers] Generate prometheus metrics - https://phabricator.wikimedia.org/T332955#10072453 (10fnegri) [09:28:05] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Quarry, 10superset.wmcloud.org: Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452#10072501 (10fnegri) [09:28:17] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10072457 (10fnegri) [10:46:40] 10Tool-schedule-deployment: schedule-deployment does not escape | characters in change subjects - https://phabricator.wikimedia.org/T372750 (10Lucas_Werkmeister_WMDE) 03NEW [10:50:41] 10Tool-schedule-deployment: schedule-deployment does not escape | characters in change subjects - https://phabricator.wikimedia.org/T372750#10072770 (10Lucas_Werkmeister_WMDE) [11:12:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:17:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:43:47] 10superset.wmcloud.org: remove superset-126-2 cluster - https://phabricator.wikimedia.org/T372752 (10rook) 03NEW [11:44:39] (03CR) 10Jean-Frédéric: [C:03+2] Updating toolforge login host [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1062402 (owner: 10Lokal Profil) [11:44:58] 10superset.wmcloud.org: remove superset-126-2 cluster - https://phabricator.wikimedia.org/T372752#10072858 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/superset-deploy/pull/30 [11:45:08] vivian-rook opened https://github.com/toolforge/superset-deploy/pull/30 [11:46:34] (03Merged) 10jenkins-bot: Updating toolforge login host [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1062402 (owner: 10Lokal Profil) [11:51:09] 10superset.wmcloud.org: remove superset-126-2 cluster - https://phabricator.wikimedia.org/T372752#10072864 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/superset-deploy/pull/30 [11:51:10] 10superset.wmcloud.org: remove superset-126-2 cluster - https://phabricator.wikimedia.org/T372752#10072865 (10rook) 05Open→03Resolved [11:51:20] vivian-rook closed https://github.com/toolforge/superset-deploy/pull/30 [12:33:14] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10073014 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1015.eqiad.wmnet with OS bookworm [13:16:29] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10073148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1015.eqiad.wmnet with OS bookworm completed: - cl... [13:23:00] 14Grid-Engine-to-K8s-Migration, 10Wiki-Loves-Monuments-Database: Migrate heritage from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319787#10073206 (10JeanFred) Status report (see also https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL) * Recreated the ven... [13:43:34] (03PS7) 10Pwangai: Exempt Test Group Repositories [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/1063008 (https://phabricator.wikimedia.org/T372565) [13:44:53] (03CR) 10Pwangai: Exempt Test Group Repositories (031 comment) [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/1063008 (https://phabricator.wikimedia.org/T372565) (owner: 10Pwangai) [14:22:44] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10073463 (10fnegri) [14:26:58] 10Cloud-VPS, 10Striker, 10Tool-gitlab-account-approval, 10Tool-phab-ban, and 5 others: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10073485 (10joanna_borun) [14:31:32] 10Cloud-VPS, 10Bitu: Find or create .deb package for mwclient 0.11.0 (or mwclient 0.10.0 with writeapi dependency removed) - https://phabricator.wikimedia.org/T372345#10073541 (10joanna_borun) [15:13:35] 10superset.wmcloud.org: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372395#10073827 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/superset-deploy/pull/29 [15:13:40] 10Cloud-VPS (Debian Buster Deprecation), 06Machine-Learning-Team, 10Wikilabels: Cloud VPS "wikilabels" project Buster deprecation - https://phabricator.wikimedia.org/T367562#10073817 (10Andrew) 05Open→03Resolved a:03Andrew > As the last custodians of the wikilabels infra, we/I can confirm that it >... [15:13:46] vivian-rook closed https://github.com/toolforge/superset-deploy/pull/29 [15:13:50] 10superset.wmcloud.org: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372395#10073828 (10rook) 05Open→03Resolved a:03rook [15:41:51] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10074000 (10Andrew) 05Resolved→03Open @cmooney says about cloudcephosd1036: > there is no sfp in port 21 on cloudsw1-d5-eqiad h... [15:42:57] 06cloud-services-team, 10Cloud-VPS, 05Goal, 13Patch-For-Review: Replace use of openstack environment settings with clouds.yaml - https://phabricator.wikimedia.org/T337577#10073989 (10Andrew) 05Resolved→03Open [15:43:53] 06cloud-services-team, 10Cloud-VPS, 05Goal, 13Patch-For-Review: Replace use of openstack environment settings with clouds.yaml - https://phabricator.wikimedia.org/T337577#10073997 (10Andrew) 05Open→03Resolved [15:49:41] FIRING: CloudVPSDesignateLeaks: Detected 12 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:58:53] 10Data-Services: Optimize querying the page table by namespace - https://phabricator.wikimedia.org/T252122#10074092 (10komla) a:03komla [15:58:55] 10Data-Services: Query is too slow ever since the migration to actor table - https://phabricator.wikimedia.org/T251801#10074093 (10komla) a:03komla [16:12:06] 10Cloud-VPS (Debian Buster Deprecation), 10Humaniki: Cloud VPS "wikidumpparse" project Buster deprecation - https://phabricator.wikimedia.org/T367561#10074160 (10Maximilianklein) update for 2024-08-15 [x] create cinder volume. [x] move project code [x] move mysql-db files [x] create a new debian bookworm inst... [16:21:06] (03PS1) 10Jean-Frédéric: Upgrade docker-compose files from legacy syntax [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1063853 [16:22:57] (03CR) 10CI reject: [V:04-1] Upgrade docker-compose files from legacy syntax [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1063853 (owner: 10Jean-Frédéric) [16:29:41] RESOLVED: CloudVPSDesignateLeaks: Detected 12 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:33:30] 10Tool-schedule-deployment: schedule-deployment should escape `|` characters in change subjects included in template parameters - https://phabricator.wikimedia.org/T372750#10074376 (10bd808) [16:35:10] 10Tool-schedule-deployment: schedule-deployment should escape `|` characters in change subjects included in template parameters - https://phabricator.wikimedia.org/T372750#10074378 (10bd808) p:05Triage→03Medium [16:37:00] 10Tool-schedule-deployment: schedule-deployment should escape `|` characters in values included in template parameters - https://phabricator.wikimedia.org/T372750#10074384 (10bd808) [17:12:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:29:14] 10VPS-project-Codesearch: Index https://gitlab.wikimedia.org/toolforge-repos/ repos - https://phabricator.wikimedia.org/T371992#10074635 (10bd808) 05Open→03Resolved a:03bd808 https://codesearch.wmcloud.org/wmcs/?q=mwclient&files=&excludeFiles=&repos= [17:39:59] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "striker" project Buster deprecation - https://phabricator.wikimedia.org/T367555#10074679 (10bd808) >>! In T367555#10070449, @Andrew wrote: > I created a new Bookworm VM and moved the cinder volume over. I expected to be able to launch striker with 'docker compo... [17:41:11] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10074692 (10VRiley-WMF) plugged the port in and also reseated management cable [17:47:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:14:13] (03PS1) 10Jean-Frédéric: Set file permissions on database configuration file in Docker build [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1063859 [18:15:25] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814 (10Andrew) 03NEW [18:23:36] (03PS2) 10Jean-Frédéric: Upgrade docker-compose files from legacy syntax [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1063853 [18:31:28] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10074987 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [18:35:42] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10074990 (10Andrew) All three of these need reimaging to get the drive labels set up properly; right now they all have a big OSD drive assigned to the os. [18:39:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [18:42:04] PROBLEM - Host cloudcephosd1038 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:34] RECOVERY - Host cloudcephosd1038 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:47:12] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [18:47:40] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [18:50:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:00:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:20:41] 06cloud-services-team, 10Cloud-VPS: wmcs.ceph.osd.bootstrap_and_add cookbook should add fewer osds at once - https://phabricator.wikimedia.org/T372821 (10Andrew) 03NEW [19:30:15] 10VPS-project-Codesearch: Index https://gitlab.wikimedia.org/toolforge-repos/ repos - https://phabricator.wikimedia.org/T371992#10075229 (10bd808) [19:30:32] 10VPS-project-Codesearch: Index known popular MediaWiki client libraries - https://phabricator.wikimedia.org/T371993#10075231 (10bd808) [19:52:47] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10075330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm executed... [20:15:38] 10VPS-project-devtools, 06collaboration-services, 10GitLab, 06Release-Engineering-Team, 13Patch-For-Review: https://gitlab.devtools.wmcloud.org is being indexed by google (and scoring pretty high) - https://phabricator.wikimedia.org/T372538#10075462 (10Dzahn) The gitlab test instance now has a robots.txt... [20:29:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-24 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:58:08] 10Quarry, 10superset.wmcloud.org: Analysis and metrics collection for quarry and superset adoption - https://phabricator.wikimedia.org/T369150#10075628 (10rook) Superset was deployed almost a year ago. I've made a graph of the daily queries in superset and quarry. The total queries between last September's lau... [21:20:04] 10Tools, 06Infrastructure-Foundations: Requested offboarding-to-volunteer of HTriedman // Transfer ownership of SpinachBot from HTriedman (WMF) to HTriedman - https://phabricator.wikimedia.org/T371644#10075703 (10KFrancis) FYI - a complete NDA is on file for Hal. Thanks all! [21:39:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [21:46:30] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17,tools-k8s-worker-nfs-24 [21:46:34] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-17,tools-k8s-worker-nfs-24 [21:46:48] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17 [21:52:20] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-17 [21:56:57] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-24 [22:02:27] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-24 [22:14:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:19:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:22:15] 10Toolforge (Toolforge iteration 14), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [maintain-harbor,docs] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176#10075763 (10Raymond_Ndibe) Documentation can be found here https://wikitech.wikimedia.org/wiki/... [22:24:36] 10Toolforge (Toolforge iteration 14), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [maintain-harbor,docs] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176#10075765 (10Raymond_Ndibe) 05Open→03In progress [22:29:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:34:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:34:18] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:39:03] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:58:01] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10075894 (10Andrew) These are now rebuilt with proper partitioning. They probably shouldn't be bootstrapped until T372821 is resolved. [23:16:05] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) [23:16:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [23:16:11] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [23:17:04] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [23:17:07] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [23:17:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [23:28:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [23:28:35] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0)