[00:06:56] FIRING: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:38:04] !log dcaro@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [01:01:56] RESOLVED: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:08:56] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:49:27] (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509) [02:52:24] (03open) 10raymond-ndibe: Draft: [maintain-harbor] add tests and configurations for new maintain-harbor jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/881 (https://phabricator.wikimedia.org/T360509) [02:52:48] (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509) [03:36:28] (03open) 10bodhisattwa: Edit ta.json [toolforge-repos/sangkalak] - 10https://gitlab.wikimedia.org/toolforge-repos/sangkalak/-/merge_requests/1 [05:08:56] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:07:50] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11022664 (10YochayCO) I believe disk-level backups are enough :) Tomorrow it is then. Updating in this task is fine. If you think I'll be required wh... [07:39:09] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 31 slow ops - https://phabricator.wikimedia.org/T400104#11022710 (10dcaro) 05Open→03Resolved a:03dcaro This was us testing with coludcephosd1006 for {T399870}. [07:39:35] 06cloud-services-team: MetricsinfraAlertmanagerDown Metricsinfra alertmanager is unreachable # page - https://phabricator.wikimedia.org/T400097#11022716 (10dcaro) 05Open→03Resolved a:03dcaro This was us testing cloudcephosd1006 for {T399870} [07:44:28] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T399870) [07:44:36] T399870: Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870 [07:45:19] !log dcaro@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T399870) [07:51:04] (03PS1) 10David Caro: depool_and_destroy: only zap devices if there were any [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1171540 [07:53:35] 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11022744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1003 for host cloudcephosd1006.... [07:54:23] 10wikitech.wikimedia.org: Adjusting protection messages on Wikitech - https://phabricator.wikimedia.org/T286859#11022746 (10Arendpieter) Do I understand correctly that this issue can be closed now that Wikitech has become a SUL wiki? [07:55:43] (03update) 10dcaro: health-check: Add explicit support for TCP probes [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/114 (https://phabricator.wikimedia.org/T400025) (owner: 10damian) [08:17:32] (03update) 10dcaro: api: Add explicit support for TCP probes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/185 [08:19:12] 06cloud-services-team, 10Toolforge: Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11022786 (10dcaro) While reviewing/testing the MRs, I was thinking about this feature, keeping in mind that: * the goal is to support UDP ports * all tcp ports should have a probe (http and/or tcp)... [08:33:50] 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11022828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1003 for host cloudcephosd1006.eqia... [08:36:00] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11022829 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1003 for host cloudceph... [08:38:41] (03approved) 10dcaro: foxtrot_ldap: fix bug when accounts already exist [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/257 [08:38:44] (03merge) 10dcaro: foxtrot_ldap: fix bug when accounts already exist [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/257 [08:48:23] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11022860 (10fnegri) cloudcephosd1006 was reimaged again on 2025-07-21, but this time //without// keeping the data. Th... [08:56:48] (03update) 10dcaro: foxtrot_ldap: move install into foxtrot_ldap role [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/238 [08:57:18] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11022867 (10fnegri) > cloudcephosd1006 now has a full rebuild of all OSDs and is running pacific and bookwor... [09:55:27] (03update) 10dcaro: foxtrot_ldap: move install into foxtrot_ldap role [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/238 [09:55:55] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11023053 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1003 for host cloudcephosd1... [09:56:13] (03update) 10dcaro: foxtrot_ldap: move install into foxtrot_ldap role [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/238 [10:02:05] (03merge) 10dcaro: foxtrot_ldap: move install into foxtrot_ldap role [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/238 [10:03:21] !log dcaro@hephaestus admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [10:03:24] !log dcaro@hephaestus admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [10:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:04:06] !log dcaro@hephaestus admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [10:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:04:20] !log dcaro@hephaestus admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [10:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:06:43] !log dcaro@hephaestus admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [10:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:11:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [10:30:51] (03open) 10dcaro: api: allow protocol to be specified for ports [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/186 [10:31:39] (03close) 10dcaro: health-check: Add explicit support for TCP probes [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/114 (https://phabricator.wikimedia.org/T400025) (owner: 10damian) [10:32:07] (03update) 10dcaro: api: Add explicit support for TCP probes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/185 [10:40:40] (03CR) 10David Caro: [C:04-1] "I think this is not really what we want, sometimes we want to zap them anyhow, rethinking..." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1171540 (owner: 10David Caro) [10:51:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [11:05:58] 06cloud-services-team, 10Toolforge: Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11023362 (10DamianZaremba) Agreed, if the fallback is implicit then this can be simplified. To support UDP then this change would just need to be along the lines of ` diff --git a/tjf/runtimes/k8s/he... [11:26:16] (03update) 10dcaro: api: fix default probe [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/185 [11:32:18] FIRING: KernelErrors: Server cloudvirt1047 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [11:32:29] 06cloud-services-team: KernelErrors Server cloudvirt1047 logged kernel errors - https://phabricator.wikimedia.org/T400134 (10phaultfinder) 03NEW [11:35:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [11:37:17] FIRING: [2x] KernelErrors: Server cloudcephosd1024 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [11:37:25] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T400136 (10phaultfinder) 03NEW [11:39:31] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T400136#11023461 (10dcaro) 05Open→03Resolved a:03dcaro We live-moved a cable to a different switch [11:39:50] 06cloud-services-team: KernelErrors Server cloudvirt1047 logged kernel errors - https://phabricator.wikimedia.org/T400134#11023467 (10dcaro) 05Open→03Resolved a:03dcaro We live-moved a cable to a different switch [11:40:13] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T400136#11023471 (10dcaro) {T334644} [11:40:19] 06cloud-services-team: KernelErrors Server cloudvirt1047 logged kernel errors - https://phabricator.wikimedia.org/T400134#11023475 (10dcaro) {T334644} [11:44:05] 06cloud-services-team, 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#11023481 (10Jclark-ctr) a:03Jclark-ctr [11:45:34] 06cloud-services-team, 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#11023482 (10ayounsi) 05Open→03Resolved All done, thanks a lot! [11:50:50] FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:52:35] 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11023534 (10dcaro) [11:52:45] 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11023539 (10dcaro) p:05Triage→03Medium [11:52:49] 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11023541 (10dcaro) a:03dcaro [11:52:53] 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11023543 (10dcaro) 05Open→03In progress [11:53:07] 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for UDP ports in jobs - https://phabricator.wikimedia.org/T400024#11023545 (10dcaro) p:05Triage→03Medium a:03dcaro [11:53:23] 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for UDP ports in jobs - https://phabricator.wikimedia.org/T400024#11023549 (10dcaro) 05Open→03In progress [11:55:50] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:58:53] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820#11023569 (10dcaro) 05In progress→03Resolved This is done :) (thanks @Andrew!) ` root@cloudcephosd1006:~... [12:03:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [12:10:28] 06cloud-services-team, 10Toolforge (Toolforge iteration 22): [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485#11023622 (10dcaro) 05Open→03In progress [12:12:08] 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11023647 (10dcaro) a:03dcaro [12:25:05] 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11023810 (10dcaro) I think that the issue is the quota: ` tools.multichill@tools-bastion-... [12:29:24] (03open) 10dcaro: maintain-kubeusers: extend multichill cronjob quota [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/882 [12:34:47] 10wikitech.wikimedia.org: Adjusting protection messages on Wikitech - https://phabricator.wikimedia.org/T286859#11023882 (10Leaderboard) 05Open→03Resolved a:03Leaderboard It seems to have been already adjusted indeed, so closing this one. [12:34:56] 10wikitech.wikimedia.org: Adjusting protection messages on Wikitech - https://phabricator.wikimedia.org/T286859#11023889 (10Leaderboard) a:05Leaderboard→03None [12:36:35] (03update) 10dcaro: [jobs-api] check services diff [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/158 (https://phabricator.wikimedia.org/T392717) (owner: 10raymond-ndibe) [12:36:50] (03update) 10dcaro: runtime: do the diff at the core.models.Job level [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182 [12:44:54] 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11023938 (10dcaro) @Multichill can you try now? I extended your job quota a bit, you shou... [12:45:22] (03approved) 10dcaro: maintain-kubeusers: extend multichill cronjob quota [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/882 [12:45:30] (03merge) 10dcaro: maintain-kubeusers: extend multichill cronjob quota [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/882 [12:47:59] (03update) 10dcaro: maintain-harbor: bump to 0.0.57-20250721140622-cd1281e2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/880 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [12:51:42] 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11023952 (10dcaro) 05Open→03In progress [12:51:59] 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023953 (10dcaro) p:05Triage→03High [12:52:12] 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023956 (10dcaro) [12:52:51] 06cloud-services-team, 10Toolforge, 13Patch-For-Review, 07Privacy: tools-static.wmflabs.org/cdnjs may return redirects to speedcf.cloudflareaccess.com, violating user privacy - https://phabricator.wikimedia.org/T399483#11023958 (10dcaro) p:05Triage→03High [12:54:17] 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023963 (10taavi) > Ideally, we would have probes tracking a number of tools, and we could page when the per... [13:00:26] 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023985 (10fnegri) > Instead of probes, what about measuring the percentage or rate of 5xx errors returned f... [13:08:21] (03merge) 10mahir256: Edit ta.json [toolforge-repos/sangkalak] - 10https://gitlab.wikimedia.org/toolforge-repos/sangkalak/-/merge_requests/1 (owner: 10bodhisattwa) [13:09:52] (03approved) 10raymond-ndibe: cli: only send fields that are set [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/112 (owner: 10dcaro) [13:10:39] (03unapproved) 10raymond-ndibe: cli: only send fields that are set [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/112 (owner: 10dcaro) [13:25:46] 10Cloud-VPS (Project-requests): Request creation of voterlists VPS project - https://phabricator.wikimedia.org/T399418#11024106 (10SD0001) >>! In T399418#11018955, @Novem_Linguae wrote: > I think the data ingested by the above ToolForge tool would use the replicas and **just ingest user_name and user_editcount**... [13:26:55] 10Cloud-VPS (Project-requests): Request creation of voterlists VPS project - https://phabricator.wikimedia.org/T399418#11024113 (10taavi) +1 [13:41:53] 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11024157 (10taavi) Not at the moment, I think. I see two options for collecting that: using [[ https://github... [14:02:26] (03open) 10raymond-ndibe: Draft: test [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/187 [14:20:47] 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163 (10Andrew) 03NEW [14:20:52] 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11024265 (10Andrew) p:05Triage→03High [14:21:55] 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11024270 (10Andrew) [14:26:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds [14:32:18] 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11024333 (10Andrew) Sounds like we don't actually need a full rebuild: •taavi> Taavi Väänänen so there's a workaround of setting `authdns_servers:` to v4 addresses if needed [14:35:29] 10Toolforge (Toolforge iteration 22): [foxtrot-ldap] publish image in harbor repos - https://phabricator.wikimedia.org/T400167 (10dcaro) 03NEW [14:36:15] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.upgrade_osds (exit_code=97) [14:36:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [14:36:25] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [14:40:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [14:40:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [14:40:59] (03open) 10vriaa: Draft: Basic banner implementation [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/1 [14:42:12] 10Toolforge (Toolforge iteration 22): [foxtrot-ldap] publish image in harbor repos - https://phabricator.wikimedia.org/T400167#11024403 (10dcaro) p:05Triage→03Low [14:42:58] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11024407 (10Andrew) Our issue resembles this upstream report: https://serverfault.com/questions/1172161/osds-stability... [14:59:24] 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11024472 (10dcaro) Adding a note here to not forget, we might want to monitor acme-chief failures (ex. if it failed to renew a cert for 2 days in a row) [15:23:41] 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11024622 (10dcaro) [15:51:41] (03update) 10dcaro: runtime: do the diff at the core.models.Job level [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182 [16:08:57] (03merge) 10dcaro: runtimes.k8s.images: use config for image refresh interval [repos/cloud/toolforge/jobs-api] (refresh_image_config_data) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/165 [16:09:01] (03update) 10dcaro: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) (owner: 10raymond-ndibe) [16:11:49] 10Toolforge (Toolforge iteration 22): [components-api] store the config used for the deployment in the deployment themselves - https://phabricator.wikimedia.org/T400064#11024910 (10dcaro) p:05Triage→03High [16:15:02] 06cloud-services-team, 10wikitech.wikimedia.org, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): wikitech-static: resume daily dumps - https://phabricator.wikimedia.org/T398968#11024921 (10BTullis) Ah, this pod isn't starting there is an error due to the permissions of the script. ` State: Terminated... [16:23:47] 10Cloud-VPS (Quota-requests), 06Moderator-Tools-Team, 10Wikilink-Tool: Request to increase Object Storage capacity - Wikilink project - https://phabricator.wikimedia.org/T399746#11024974 (10fnegri) > I think it's fine to increase the quota to 100K or even 200K objects, but I'd like to understand why the curr... [16:23:57] 06cloud-services-team, 10wikitech.wikimedia.org, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: wikitech-static: resume daily dumps - https://phabricator.wikimedia.org/T398968#11024976 (10BTullis) I have synced the last few days' worth manually again. ` dumpsgen@dumpsdata1003:/data/othe... [16:26:19] 10Cloud-VPS (Quota-requests), 06Moderator-Tools-Team, 10Wikilink-Tool: Request to increase Object Storage capacity - Wikilink project - https://phabricator.wikimedia.org/T399746#11024984 (10bd808) +1 for implementing [16:26:47] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 06Moderator-Tools-Team, 10Wikilink-Tool: Request to increase Object Storage capacity - Wikilink project - https://phabricator.wikimedia.org/T399746#11024987 (10bd808) [16:39:55] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 06Moderator-Tools-Team, 10Wikilink-Tool: Request to increase Object Storage capacity - Wikilink project - https://phabricator.wikimedia.org/T399746#11025026 (10fnegri) 05Open→03Resolved I increased the current quotas to 100,000 objects and 75GB (... [17:05:02] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 06Moderator-Tools-Team, 10Wikilink-Tool: Request to increase Object Storage capacity - Wikilink project - https://phabricator.wikimedia.org/T399746#11025149 (10Scardenasmolinar) Thank you for the quick turnaround! [18:19:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds [18:37:09] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.upgrade_osds (exit_code=99) [19:00:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds [19:06:26] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.upgrade_osds (exit_code=0) [19:09:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds [19:15:54] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.upgrade_osds (exit_code=0) [19:15:54] 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11025735 (10cmooney) >>! In T400163#11024333, @Andrew wrote: > Sounds like we don't actually need a full rebuild: > > •taavi> Taavi Väänänen so there's a workaround of setting `authdns_server... [19:40:20] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_mons [19:45:10] FIRING: [2x] ProjectProxyMainProxyCertificateExpiry: Certificate for proxy on proxy-5 is about to expire (4d 17h 37m 52s to expiration) - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyCertificateExpiry [19:51:05] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11025824 (10VRiley-WMF) a:03VRiley-WMF [19:51:48] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11025826 (10VRiley-WMF) [19:54:14] 10Tool-extjsonuploader: extjsonuploader seems to have stopped updating ExtensionJson - https://phabricator.wikimedia.org/T400044#11025830 (10A_smart_kitten) 05Open→03Resolved Looks like this may have resolved itself - https://www.mediawiki.org/wiki/Special:Contributions/Bawolff_bot shows two updates to `... [19:56:11] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.upgrade_mons (exit_code=0) [20:00:12] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11025850 (10VRiley-WMF) clouddb1022 Rack A4 U33 CableID: 20220030 Port: 43 clouddb1023 Rack B2 U32 CableID: 5253 Port: 39 clouddb1024 Rack E8 U35 clouddb1025... [20:06:23] 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212 (10PerfektesChaos) 03NEW [20:24:04] 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11025942 (10bd808) Possible duplicate of {T226688}. [20:33:25] 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11025977 (10bd808) >* In T393487#11024836 it is claimed that “these wikis all have robots.txt files that tell all crawlers to ignore the sites”. >** Well... [20:34:13] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T399144#11025982 (10Andrew) 05Open→03Resolved a:03Andrew [20:34:15] 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11025981 (10bd808) > On the other hand, the IP blocking at BETA should be terminated as soon as possible. IP ranges are not a good idea to distinguish bo... [20:34:37] 06cloud-services-team, 10Cloud-VPS: Prevent creation of VMs on the old ipv4 network - https://phabricator.wikimedia.org/T399127#11025984 (10Andrew) p:05Triage→03Medium [20:48:56] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [20:49:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [20:55:26] 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11026003 (10Wurgl) libwww-perl/6.68 is the record holder (3 times more visits that PetalBot). But PetalBot was the one, who found really time-consuming u... [21:13:55] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [21:14:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [23:04:17] (03update) 10raymond-ndibe: Draft: [maintain-harbor] add tests and configurations for new maintain-harbor jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/881 (https://phabricator.wikimedia.org/T360509) [23:05:43] (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509) [23:39:28] FIRING: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.06%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull