[00:06:21] FIRING: [2x] PrometheusK8sCertExpirySoon: Prometheus k8s certificate is about to expire - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/PrometheusK8sCertExpirySoon - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPrometheusK8sCertExpirySoon [00:36:28] FIRING: InstanceDown: Project cvn instance cvn-app13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [02:38:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:06:28] RESOLVED: InstanceDown: Project cvn instance cvn-app13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:12:28] FIRING: InstanceDown: Project cvn instance cvn-app13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:47:28] RESOLVED: InstanceDown: Project cvn instance cvn-app13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:51:28] FIRING: InstanceDown: Project cvn instance cvn-app13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:08:13] (03open) 10raymond-ndibe: Draft: [components-api] add components-api conditional build and run tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/797 (https://phabricator.wikimedia.org/T389044) [05:11:16] (03update) 10raymond-ndibe: [components.deployment.create] add force-build and force-run option [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/33 (https://phabricator.wikimedia.org/T389044) [05:14:42] (03update) 10raymond-ndibe: [components.deployment.create] add force-build and force-run option [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/33 (https://phabricator.wikimedia.org/T389044) [05:34:07] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10858110 (10Marostegui) a:03Marostegui Since there are no objections, I will remove sanitarium hosts and will reconvert sanitarium masters to normal replicas in codfw. [06:38:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:08:21] (03PS3) 10Slyngshede: Build: Update build system [labs/countervandalism/CVNBot] - 10https://gerrit.wikimedia.org/r/1143806 [07:09:02] (03CR) 10CI reject: [V:04-1] Build: Update build system [labs/countervandalism/CVNBot] - 10https://gerrit.wikimedia.org/r/1143806 (owner: 10Slyngshede) [07:32:39] (03PS4) 10Slyngshede: Build: Update build system [labs/countervandalism/CVNBot] - 10https://gerrit.wikimedia.org/r/1143806 [07:33:10] (03CR) 10CI reject: [V:04-1] Build: Update build system [labs/countervandalism/CVNBot] - 10https://gerrit.wikimedia.org/r/1143806 (owner: 10Slyngshede) [07:34:41] (03CR) 10Slyngshede: Build: Update build system (031 comment) [labs/countervandalism/CVNBot] - 10https://gerrit.wikimedia.org/r/1143806 (owner: 10Slyngshede) [07:41:48] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on cloudservices2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:46:48] RESOLVED: PuppetConstantChange: Puppet performing a change on every puppet run on cloudservices2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:18:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:14:34] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.937% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:36:56] 06cloud-services-team, 10Cloud-VPS, 07Upstream: codfw1dev has seen neutron metadata agents down since epoxy upgrade - https://phabricator.wikimedia.org/T395255#10858961 (10taavi) a:05Andrew→03taavi I think this is a Neutron bug. https://review.opendev.org/c/openstack/neutron/+/942916, first included in N... [11:51:01] 06cloud-services-team, 10Cloud-VPS, 07Upstream: codfw1dev has seen neutron metadata agents down since epoxy upgrade - https://phabricator.wikimedia.org/T395255#10858993 (10taavi) a:05taavi→03None Just adding that call back doesn't quite work, my best guess is that's because `oslo.service` in Epoxy doesn'... [12:35:44] 06cloud-services-team, 10Cloud-VPS: Upgrade cloudlb hosts to bookworm - https://phabricator.wikimedia.org/T375082#10859189 (10taavi) a:03taavi [12:37:09] 06cloud-services-team, 10Cloud-VPS: Upgrade cloudlb hosts to bookworm - https://phabricator.wikimedia.org/T375082#10859196 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1002 for host cloudlb2003-dev.codfw.wmnet with OS bookworm [12:38:21] 10cloud-services-team (Hardware), 10Cloud-VPS: replace cloudlb2001-dev with cloudlb2004-dev - https://phabricator.wikimedia.org/T377126#10859202 (10taavi) 05Open→03Resolved [12:44:34] RESOLVED: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.943% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:57:13] 06cloud-services-team, 10Data-Services, 06DBA, 13Patch-For-Review: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10859249 (10ops-monitoring-bot) Upgrading db2191.codfw.wmnet [12:57:23] 06cloud-services-team, 10Data-Services, 06DBA, 13Patch-For-Review: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10859251 (10ops-monitoring-bot) Completed depool of db2191 - Upgrading db2191.codfw.wmnet - marostegui@cumin1002 [13:02:49] 06cloud-services-team, 10Data-Services, 06DBA, 13Patch-For-Review: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10859290 (10ops-monitoring-bot) Upgrade of db2191.codfw.wmnet completed [13:02:53] 06cloud-services-team, 10Data-Services, 06DBA, 13Patch-For-Review: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10859293 (10ops-monitoring-bot) Upgrade of db2191.codfw.wmnet completed [13:13:28] FIRING: [2x] TargetDown: Job app is unreachable in project quarry instance quarry.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [13:13:39] FIRING: QuarryDown: Quarry application is unreachable - https://prometheus-alerts.wmcloud.org/?q=alertname%3DQuarryDown [13:22:02] 06cloud-services-team, 10Cloud-VPS: Upgrade cloudlb hosts to bookworm - https://phabricator.wikimedia.org/T375082#10859353 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1002 for host cloudlb2003-dev.codfw.wmnet with OS bookworm executed with errors: - cloudlb2003-dev (**FAIL... [13:23:28] RESOLVED: [2x] TargetDown: Job app is unreachable in project quarry instance quarry.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [13:23:39] RESOLVED: QuarryDown: Quarry application is unreachable - https://prometheus-alerts.wmcloud.org/?q=alertname%3DQuarryDown [13:30:05] 06cloud-services-team, 10Cloud-VPS: Upgrade cloudlb hosts to bookworm - https://phabricator.wikimedia.org/T375082#10859375 (10taavi) [13:33:13] 06cloud-services-team, 10Cloud-VPS: Upgrade cloudlb hosts to bookworm - https://phabricator.wikimedia.org/T375082#10859385 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1002 for host cloudlb2002-dev.codfw.wmnet with OS bookworm [13:34:41] 06cloud-services-team, 10Cloud-VPS, 07Upstream: codfw1dev has seen neutron metadata agents down since epoxy upgrade - https://phabricator.wikimedia.org/T395255#10859391 (10Andrew) Thank you for noticing and logging this! Do you know what user-facing symptoms result from this issue? Seems like we need to add... [13:37:25] 06cloud-services-team, 10Cloud-VPS, 07Upstream: codfw1dev has seen neutron metadata agents down since epoxy upgrade - https://phabricator.wikimedia.org/T395255#10859393 (10taavi) >>! In T395255#10859391, @Andrew wrote: > Thank you for noticing and logging this! Do you know what user-facing symptoms result fr... [13:39:10] 06cloud-services-team, 10Toolforge: Is Using sentry for error monitoring against wikimedia cloud privacy policy? - https://phabricator.wikimedia.org/T394577#10859398 (10Andrew) Yep, we rely on OSI to determined whether something is properly opensource or not, so this would not pass that test. However since you... [13:46:57] 06cloud-services-team, 10Data-Services, 06DBA, 13Patch-For-Review: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10859417 (10ops-monitoring-bot) Started cloning db2191.codfw.wmnet to db2186.codfw.wmnet - marostegui@cumin1002 [14:03:41] (03approved) 10chuckonwumelu: Remove toolsbeta-prometheus-1 volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/45 (owner: 10taavi) [14:19:04] 06cloud-services-team, 10Cloud-VPS: Upgrade cloudlb hosts to bookworm - https://phabricator.wikimedia.org/T375082#10859574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1002 for host cloudlb2002-dev.codfw.wmnet with OS bookworm completed: - cloudlb2002-dev (**PASS**) - Dow... [14:19:51] 06cloud-services-team, 10Cloud-VPS: Upgrade cloudlb hosts to bookworm - https://phabricator.wikimedia.org/T375082#10859575 (10taavi) [14:21:16] 06cloud-services-team, 10Toolforge: [infra] Reports of slow connectivity from APAC - https://phabricator.wikimedia.org/T395135#10859582 (10Nokib_Sarkar) ## What is the output of http://test-ipv6.com/helpdesk/ ? ` Help desk code: 4 IPv4 Only IPv4: Good, AS23956 - AMBERIT-BD-AS AmberIT Limited IPv6: no IPv4 ad... [14:44:19] FIRING: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [14:49:19] RESOLVED: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [15:09:19] FIRING: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [15:14:19] RESOLVED: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [15:14:20] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10859814 (10ops-monitoring-bot) Start pool of db2191 gradually with 4 steps - Pool db2191.codfw.wmnet in after cloning - marostegui@cumin1002 [15:19:22] (03open) 10dcaro: reboot: added a script to restart containerd [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/245 [15:20:35] (03update) 10dcaro: reboot: added a script to restart containerd [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/245 [15:22:23] (03approved) 10dcaro: readme: add link to packaging docs [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/32 [15:22:26] (03merge) 10dcaro: readme: add link to packaging docs [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/32 [15:23:25] (03update) 10dcaro: [deploy] add force-build and force-run query params [repos/cloud/toolforge/components-api] (skip_build_if_refs_are_same) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/80 (https://phabricator.wikimedia.org/T389044) (owner: 10raymond-ndibe) [15:23:34] (03update) 10dcaro: [deploy] skip build if refs are same [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/77 (https://phabricator.wikimedia.org/T389044) (owner: 10raymond-ndibe) [15:24:08] (03update) 10dcaro: [deploy] add force-build and force-run query params [repos/cloud/toolforge/components-api] (skip_build_if_refs_are_same) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/80 (https://phabricator.wikimedia.org/T389044) (owner: 10raymond-ndibe) [15:24:26] (03update) 10dcaro: [components.deployment.create] add force-build and force-run option [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/33 (https://phabricator.wikimedia.org/T389044) (owner: 10raymond-ndibe) [15:24:36] (03update) 10dcaro: [components.deployment.create] add force-build and force-run option [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/33 (https://phabricator.wikimedia.org/T389044) (owner: 10raymond-ndibe) [15:28:23] (03approved) 10fnegri: reboot: added a script to restart containerd [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/245 (owner: 10dcaro) [15:29:08] (03merge) 10dcaro: reboot: added a script to restart containerd [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/245 [15:29:09] (03update) 10dcaro: reboot: added a script to restart containerd [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/245 [15:30:08] (03update) 10dcaro: [deploy] support health-checks and port [repos/cloud/toolforge/components-api] (update_toolforge_models) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/75 (https://phabricator.wikimedia.org/T362072) (owner: 10raymond-ndibe) [15:30:47] (03update) 10dcaro: [deploy] support health-checks and port [repos/cloud/toolforge/components-api] (update_toolforge_models) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/75 (https://phabricator.wikimedia.org/T362072) (owner: 10raymond-ndibe) [15:31:15] (03approved) 10dcaro: [toolforge_models] update toolforge_models.py [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/79 (owner: 10raymond-ndibe) [15:31:16] (03update) 10dcaro: [toolforge_models] update toolforge_models.py [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/79 (owner: 10raymond-ndibe) [15:59:43] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10860047 (10ops-monitoring-bot) Completed pool of db2191 gradually with 4 steps - Pool db2191.codfw.wmnet in after cloning - marostegui@cumin1002 [16:05:30] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10860091 (10Marostegui) db2186 has been converted into a x1 slave - given it till tomorrow to replicate before start to pool it for the first time. [16:06:17] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10860102 (10ops-monitoring-bot) Upgrading db2186.codfw.wmnet [16:14:36] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10860185 (10ops-monitoring-bot) Upgrade of db2186.codfw.wmnet completed [16:14:42] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10860188 (10ops-monitoring-bot) Upgrade of db2186.codfw.wmnet completed [16:16:17] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10860198 (10taavi) [16:35:44] (03update) 10chuckonwumelu: [api] Adding warning message for beta [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/78 (https://phabricator.wikimedia.org/T394277) [16:38:27] 06cloud-services-team, 10Toolforge: Renew Prometheus K8s cert - https://phabricator.wikimedia.org/T395227#10860321 (10dcaro) a:03dcaro [16:38:33] 06cloud-services-team, 10Toolforge: Renew Prometheus K8s cert - https://phabricator.wikimedia.org/T395227#10860322 (10dcaro) 05Open→03In progress [16:53:29] 06cloud-services-team, 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: Renew Prometheus K8s cert - https://phabricator.wikimedia.org/T395227#10860430 (10dcaro) Alerts are gone, all the prometheus targets are up (after a puppet run that triggered a restart of prometheus), flagging it as solved :) [16:53:38] 06cloud-services-team, 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: [k8s,infra] Renew Prometheus K8s cert - https://phabricator.wikimedia.org/T395227#10860444 (10dcaro) [16:53:41] 06cloud-services-team, 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: [k8s,infra] Renew Prometheus K8s cert - https://phabricator.wikimedia.org/T395227#10860446 (10dcaro) 05In progress→03Resolved [16:58:06] 10Tool-yearinreview, 06Indic MediaWiki Developers UG, 06Indic-TechCom: it should be a day instead of edits in yearinreview tool - https://phabricator.wikimedia.org/T377669#10860466 (10Gopavasanth) a:03marrivs [16:58:38] (03update) 10dcaro: Draft: [components-api] add components-api conditional build and run tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/797 (https://phabricator.wikimedia.org/T389044) (owner: 10raymond-ndibe) [17:04:19] (03update) 10dcaro: [deploy] skip build if refs are same [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/77 (https://phabricator.wikimedia.org/T389044) (owner: 10raymond-ndibe) [17:52:52] 10Data-Services, 06Data-Engineering: Create a view for existencelinks table - https://phabricator.wikimedia.org/T394898#10860849 (10Milimetric) @fnegri just to help with prioritization, when do you need us to sign off on this? [18:05:16] 06cloud-services-team, 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥): Recent incidents of buildkitd's storage volume filling up - https://phabricator.wikimedia.org/T395097#10860917 (10thcipriani) >>! In T395097#10859698, @dancy wrote: > I think this is related to the `publish-b... [18:10:32] 06cloud-services-team, 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥): Recent incidents of buildkitd's storage volume filling up - https://phabricator.wikimedia.org/T395097#10860942 (10dancy) >>! In T395097#10860917, @thcipriani wrote: >>>! In T395097#10859698, @dancy wrote: >>... [18:13:33] 06cloud-services-team, 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥): Recent incidents of buildkitd's storage volume filling up - https://phabricator.wikimedia.org/T395097#10860953 (10dancy) 05Open→03In progress p:05Triage→03Low a:03dancy [22:19:41] 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10861590 (10Aklapper)