[00:06:56] <jinxer-wm>	 FIRING: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[00:38:04] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0)
[01:01:56] <jinxer-wm>	 RESOLVED: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[01:08:56] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[02:49:27] <wikibugs>	 (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509)
[02:52:24] <wikibugs>	 (03open) 10raymond-ndibe: Draft: [maintain-harbor] add tests and configurations for new maintain-harbor jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/881 (https://phabricator.wikimedia.org/T360509)
[02:52:48] <wikibugs>	 (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509)
[03:36:28] <wikibugs>	 (03open) 10bodhisattwa: Edit ta.json [toolforge-repos/sangkalak] - 10https://gitlab.wikimedia.org/toolforge-repos/sangkalak/-/merge_requests/1
[05:08:56] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[07:07:50] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11022664 (10YochayCO) I believe disk-level backups are enough :)  Tomorrow it is then. Updating in this task is fine. If you think I'll be required wh...
[07:39:09] <wikibugs>	 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 31 slow ops - https://phabricator.wikimedia.org/T400104#11022710 (10dcaro) 05Open→03Resolved a:03dcaro This was us testing with coludcephosd1006 for {T399870}.
[07:39:35] <wikibugs>	 06cloud-services-team: MetricsinfraAlertmanagerDown Metricsinfra alertmanager is unreachable # page - https://phabricator.wikimedia.org/T400097#11022716 (10dcaro) 05Open→03Resolved a:03dcaro This was us testing cloudcephosd1006 for {T399870}
[07:44:28] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T399870)
[07:44:36] <stashbot>	 T399870: Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870
[07:45:19] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T399870)
[07:51:04] <wikibugs>	 (03PS1) 10David Caro: depool_and_destroy: only zap devices if there were any [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1171540
[07:53:35] <wikibugs>	 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11022744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1003 for host cloudcephosd1006....
[07:54:23] <wikibugs>	 10wikitech.wikimedia.org: Adjusting protection messages on Wikitech - https://phabricator.wikimedia.org/T286859#11022746 (10Arendpieter) Do I understand correctly that this issue can be closed now that Wikitech has become a SUL wiki?
[07:55:43] <wikibugs>	 (03update) 10dcaro: health-check: Add explicit support for TCP probes [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/114 (https://phabricator.wikimedia.org/T400025) (owner: 10damian)
[08:17:32] <wikibugs>	 (03update) 10dcaro: api: Add explicit support for TCP probes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/185
[08:19:12] <wikibugs>	 06cloud-services-team, 10Toolforge: Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11022786 (10dcaro) While reviewing/testing the MRs, I was thinking about this feature, keeping in mind that:  * the goal is to support UDP ports * all tcp ports should have a probe (http and/or tcp)...
[08:33:50] <wikibugs>	 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11022828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1003 for host cloudcephosd1006.eqia...
[08:36:00] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11022829 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1003 for host cloudceph...
[08:38:41] <wikibugs>	 (03approved) 10dcaro: foxtrot_ldap: fix bug when accounts already exist [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/257
[08:38:44] <wikibugs>	 (03merge) 10dcaro: foxtrot_ldap: fix bug when accounts already exist [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/257
[08:48:23] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11022860 (10fnegri) cloudcephosd1006 was reimaged again on 2025-07-21, but this time //without// keeping the data.  Th...
[08:56:48] <wikibugs>	 (03update) 10dcaro: foxtrot_ldap: move install into foxtrot_ldap role [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/238
[08:57:18] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11022867 (10fnegri) > cloudcephosd1006 now has a full rebuild of all OSDs and is running pacific and bookwor...
[09:55:27] <wikibugs>	 (03update) 10dcaro: foxtrot_ldap: move install into foxtrot_ldap role [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/238
[09:55:55] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11023053 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1003 for host cloudcephosd1...
[09:56:13] <wikibugs>	 (03update) 10dcaro: foxtrot_ldap: move install into foxtrot_ldap role [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/238
[10:02:05] <wikibugs>	 (03merge) 10dcaro: foxtrot_ldap: move install into foxtrot_ldap role [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/238
[10:03:21] <wm-bot2>	 !log dcaro@hephaestus admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add
[10:03:24] <wm-bot2>	 !log dcaro@hephaestus admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99)
[10:03:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[10:03:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[10:04:06] <wm-bot2>	 !log dcaro@hephaestus admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add
[10:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[10:04:20] <wm-bot2>	 !log dcaro@hephaestus admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0)
[10:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[10:06:43] <wm-bot2>	 !log dcaro@hephaestus admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add
[10:06:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[10:11:09] <jinxer-wm>	 FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[10:30:51] <wikibugs>	 (03open) 10dcaro: api: allow protocol to be specified for ports [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/186
[10:31:39] <wikibugs>	 (03close) 10dcaro: health-check: Add explicit support for TCP probes [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/114 (https://phabricator.wikimedia.org/T400025) (owner: 10damian)
[10:32:07] <wikibugs>	 (03update) 10dcaro: api: Add explicit support for TCP probes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/185
[10:40:40] <wikibugs>	 (03CR) 10David Caro: [C:04-1] "I think this is not really what we want, sometimes we want to zap them anyhow, rethinking..." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1171540 (owner: 10David Caro)
[10:51:09] <jinxer-wm>	 RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[11:05:58] <wikibugs>	 06cloud-services-team, 10Toolforge: Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11023362 (10DamianZaremba) Agreed, if the fallback is implicit then this can be simplified. To support UDP then this change would just need to be along the lines of ` diff --git a/tjf/runtimes/k8s/he...
[11:26:16] <wikibugs>	 (03update) 10dcaro: api: fix default probe [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/185
[11:32:18] <jinxer-wm>	 FIRING: KernelErrors: Server cloudvirt1047 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors
[11:32:29] <wikibugs>	 06cloud-services-team: KernelErrors Server cloudvirt1047 logged kernel errors - https://phabricator.wikimedia.org/T400134 (10phaultfinder) 03NEW
[11:35:09] <jinxer-wm>	 FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[11:37:17] <jinxer-wm>	 FIRING: [2x] KernelErrors: Server cloudcephosd1024 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors  - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors
[11:37:25] <wikibugs>	 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T400136 (10phaultfinder) 03NEW
[11:39:31] <wikibugs>	 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T400136#11023461 (10dcaro) 05Open→03Resolved a:03dcaro We live-moved a cable to a different switch
[11:39:50] <wikibugs>	 06cloud-services-team: KernelErrors Server cloudvirt1047 logged kernel errors - https://phabricator.wikimedia.org/T400134#11023467 (10dcaro) 05Open→03Resolved a:03dcaro We live-moved a cable to a different switch
[11:40:13] <wikibugs>	 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T400136#11023471 (10dcaro) {T334644}
[11:40:19] <wikibugs>	 06cloud-services-team: KernelErrors Server cloudvirt1047 logged kernel errors - https://phabricator.wikimedia.org/T400134#11023475 (10dcaro) {T334644}
[11:44:05] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#11023481 (10Jclark-ctr) a:03Jclark-ctr
[11:45:34] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#11023482 (10ayounsi) 05Open→03Resolved All done, thanks a lot!
[11:50:50] <wmcs-alerts>	 FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[11:52:35] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11023534 (10dcaro)
[11:52:45] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11023539 (10dcaro) p:05Triage→03Medium
[11:52:49] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11023541 (10dcaro) a:03dcaro
[11:52:53] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for TCP health checking - https://phabricator.wikimedia.org/T400025#11023543 (10dcaro) 05Open→03In progress
[11:53:07] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for UDP ports in jobs - https://phabricator.wikimedia.org/T400024#11023545 (10dcaro) p:05Triage→03Medium a:03dcaro
[11:53:23] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 22): Support for UDP ports in jobs - https://phabricator.wikimedia.org/T400024#11023549 (10dcaro) 05Open→03In progress
[11:55:50] <wmcs-alerts>	 RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[11:58:53] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820#11023569 (10dcaro) 05In progress→03Resolved This is done :) (thanks @Andrew!) ` root@cloudcephosd1006:~...
[12:03:39] <jinxer-wm>	 RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[12:10:28] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 22): [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485#11023622 (10dcaro) 05Open→03In progress
[12:12:08] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11023647 (10dcaro) a:03dcaro
[12:25:05] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11023810 (10dcaro) I think that the issue is the quota: ` tools.multichill@tools-bastion-...
[12:29:24] <wikibugs>	 (03open) 10dcaro: maintain-kubeusers: extend multichill cronjob quota [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/882
[12:34:47] <wikibugs>	 10wikitech.wikimedia.org: Adjusting protection messages on Wikitech - https://phabricator.wikimedia.org/T286859#11023882 (10Leaderboard) 05Open→03Resolved a:03Leaderboard It seems to have been already adjusted indeed, so closing this one.
[12:34:56] <wikibugs>	 10wikitech.wikimedia.org: Adjusting protection messages on Wikitech - https://phabricator.wikimedia.org/T286859#11023889 (10Leaderboard) a:05Leaderboard→03None
[12:36:35] <wikibugs>	 (03update) 10dcaro: [jobs-api] check services diff [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/158 (https://phabricator.wikimedia.org/T392717) (owner: 10raymond-ndibe)
[12:36:50] <wikibugs>	 (03update) 10dcaro: runtime: do the diff at the core.models.Job level [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182
[12:44:54] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11023938 (10dcaro) @Multichill can you try now? I extended your job quota a bit, you shou...
[12:45:22] <wikibugs>	 (03approved) 10dcaro: maintain-kubeusers: extend multichill cronjob quota [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/882
[12:45:30] <wikibugs>	 (03merge) 10dcaro: maintain-kubeusers: extend multichill cronjob quota [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/882
[12:47:59] <wikibugs>	 (03update) 10dcaro: maintain-harbor: bump to 0.0.57-20250721140622-cd1281e2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/880 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[12:51:42] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11023952 (10dcaro) 05Open→03In progress
[12:51:59] <wikibugs>	 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023953 (10dcaro) p:05Triage→03High
[12:52:12] <wikibugs>	 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023956 (10dcaro)
[12:52:51] <wikibugs>	 06cloud-services-team, 10Toolforge, 13Patch-For-Review, 07Privacy: tools-static.wmflabs.org/cdnjs may return redirects to speedcf.cloudflareaccess.com, violating user privacy - https://phabricator.wikimedia.org/T399483#11023958 (10dcaro) p:05Triage→03High
[12:54:17] <wikibugs>	 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023963 (10taavi) > Ideally, we would have probes tracking a number of tools, and we could page when the per...
[13:00:26] <wikibugs>	 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023985 (10fnegri) > Instead of probes, what about measuring the percentage or rate of 5xx errors returned f...
[13:08:21] <wikibugs>	 (03merge) 10mahir256: Edit ta.json [toolforge-repos/sangkalak] - 10https://gitlab.wikimedia.org/toolforge-repos/sangkalak/-/merge_requests/1 (owner: 10bodhisattwa)
[13:09:52] <wikibugs>	 (03approved) 10raymond-ndibe: cli: only send fields that are set [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/112 (owner: 10dcaro)
[13:10:39] <wikibugs>	 (03unapproved) 10raymond-ndibe: cli: only send fields that are set [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/112 (owner: 10dcaro)
[13:25:46] <wikibugs>	 10Cloud-VPS (Project-requests): Request creation of voterlists VPS project - https://phabricator.wikimedia.org/T399418#11024106 (10SD0001) >>! In T399418#11018955, @Novem_Linguae wrote: > I think the data ingested by the above ToolForge tool would use the replicas and **just ingest user_name and user_editcount**...
[13:26:55] <wikibugs>	 10Cloud-VPS (Project-requests): Request creation of voterlists VPS project - https://phabricator.wikimedia.org/T399418#11024113 (10taavi) +1
[13:41:53] <wikibugs>	 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11024157 (10taavi) Not at the moment, I think. I see two options for collecting that: using [[ https://github...
[14:02:26] <wikibugs>	 (03open) 10raymond-ndibe: Draft: test [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/187
[14:20:47] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163 (10Andrew) 03NEW
[14:20:52] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11024265 (10Andrew) p:05Triage→03High
[14:21:55] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11024270 (10Andrew)
[14:26:51] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds
[14:32:18] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11024333 (10Andrew) Sounds like we don't actually need a full rebuild:    •taavi> Taavi Väänänen so there's a workaround of setting `authdns_servers:` to v4 addresses if needed
[14:35:29] <wikibugs>	 10Toolforge (Toolforge iteration 22): [foxtrot-ldap] publish image in harbor repos - https://phabricator.wikimedia.org/T400167 (10dcaro) 03NEW
[14:36:15] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.upgrade_osds (exit_code=97)
[14:36:24] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[14:36:25] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99)
[14:40:32] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[14:40:39] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0)
[14:40:59] <wikibugs>	 (03open) 10vriaa: Draft: Basic banner implementation [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/1
[14:42:12] <wikibugs>	 10Toolforge (Toolforge iteration 22): [foxtrot-ldap] publish image in harbor repos - https://phabricator.wikimedia.org/T400167#11024403 (10dcaro) p:05Triage→03Low
[14:42:58] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11024407 (10Andrew) Our issue resembles this upstream report: https://serverfault.com/questions/1172161/osds-stability...
[14:59:24] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11024472 (10dcaro) Adding a note here to not forget, we might want to monitor acme-chief failures (ex. if it failed to renew a cert for 2 days in a row)
[15:23:41] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11024622 (10dcaro)
[15:51:41] <wikibugs>	 (03update) 10dcaro: runtime: do the diff at the core.models.Job level [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182
[16:08:57] <wikibugs>	 (03merge) 10dcaro: runtimes.k8s.images: use config for image refresh interval [repos/cloud/toolforge/jobs-api] (refresh_image_config_data) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/165
[16:09:01] <wikibugs>	 (03update) 10dcaro: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) (owner: 10raymond-ndibe)
[16:11:49] <wikibugs>	 10Toolforge (Toolforge iteration 22): [components-api] store the config used for the deployment in the deployment themselves - https://phabricator.wikimedia.org/T400064#11024910 (10dcaro) p:05Triage→03High
[16:15:02] <wikibugs>	 06cloud-services-team, 10wikitech.wikimedia.org, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): wikitech-static: resume daily dumps - https://phabricator.wikimedia.org/T398968#11024921 (10BTullis) Ah, this pod isn't starting there is an error due to the permissions of the script. ` State:          Terminated...
[16:23:47] <wikibugs>	 10Cloud-VPS (Quota-requests), 06Moderator-Tools-Team, 10Wikilink-Tool: Request to increase Object Storage capacity - Wikilink project - https://phabricator.wikimedia.org/T399746#11024974 (10fnegri) > I think it's fine to increase the quota to 100K or even 200K objects, but I'd like to understand why the curr...
[16:23:57] <wikibugs>	 06cloud-services-team, 10wikitech.wikimedia.org, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: wikitech-static: resume daily dumps - https://phabricator.wikimedia.org/T398968#11024976 (10BTullis) I have synced the last few days' worth manually again. ` dumpsgen@dumpsdata1003:/data/othe...
[16:26:19] <wikibugs>	 10Cloud-VPS (Quota-requests), 06Moderator-Tools-Team, 10Wikilink-Tool: Request to increase Object Storage capacity - Wikilink project - https://phabricator.wikimedia.org/T399746#11024984 (10bd808) +1 for implementing
[16:26:47] <wikibugs>	 06cloud-services-team, 10Cloud-VPS (Quota-requests), 06Moderator-Tools-Team, 10Wikilink-Tool: Request to increase Object Storage capacity - Wikilink project - https://phabricator.wikimedia.org/T399746#11024987 (10bd808)
[16:39:55] <wikibugs>	 06cloud-services-team, 10Cloud-VPS (Quota-requests), 06Moderator-Tools-Team, 10Wikilink-Tool: Request to increase Object Storage capacity - Wikilink project - https://phabricator.wikimedia.org/T399746#11025026 (10fnegri) 05Open→03Resolved I increased the current quotas to 100,000 objects and 75GB (...
[17:05:02] <wikibugs>	 06cloud-services-team, 10Cloud-VPS (Quota-requests), 06Moderator-Tools-Team, 10Wikilink-Tool: Request to increase Object Storage capacity - Wikilink project - https://phabricator.wikimedia.org/T399746#11025149 (10Scardenasmolinar) Thank you for the quick turnaround!
[18:19:46] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds
[18:37:09] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.upgrade_osds (exit_code=99)
[19:00:06] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds
[19:06:26] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.upgrade_osds (exit_code=0)
[19:09:24] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds
[19:15:54] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.upgrade_osds (exit_code=0)
[19:15:54] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163#11025735 (10cmooney) >>! In T400163#11024333, @Andrew wrote: > Sounds like we don't actually need a full rebuild: >  > •taavi> Taavi Väänänen so there's a workaround of setting `authdns_server...
[19:40:20] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_mons
[19:45:10] <wmcs-alerts>	 FIRING: [2x] ProjectProxyMainProxyCertificateExpiry: Certificate for proxy on proxy-5 is about to expire (4d 17h 37m 52s to expiration)   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyCertificateExpiry
[19:51:05] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11025824 (10VRiley-WMF) a:03VRiley-WMF
[19:51:48] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11025826 (10VRiley-WMF)
[19:54:14] <wikibugs>	 10Tool-extjsonuploader: extjsonuploader seems to have stopped updating ExtensionJson - https://phabricator.wikimedia.org/T400044#11025830 (10A_smart_kitten) 05Open→03Resolved Looks like this may have resolved itself - https://www.mediawiki.org/wiki/Special:Contributions/Bawolff_bot shows two updates to `...
[19:56:11] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.upgrade_mons (exit_code=0)
[20:00:12] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11025850 (10VRiley-WMF) clouddb1022 Rack A4 U33 CableID: 20220030 Port: 43  clouddb1023 Rack B2 U32 CableID: 5253 Port: 39  clouddb1024 Rack E8 U35  clouddb1025...
[20:06:23] <wikibugs>	 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212 (10PerfektesChaos) 03NEW
[20:24:04] <wikibugs>	 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11025942 (10bd808) Possible duplicate of {T226688}.
[20:33:25] <wikibugs>	 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11025977 (10bd808) >* In T393487#11024836 it is claimed that “these wikis all have robots.txt files that tell all crawlers to ignore the sites”. >** Well...
[20:34:13] <wikibugs>	 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T399144#11025982 (10Andrew) 05Open→03Resolved a:03Andrew
[20:34:15] <wikibugs>	 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11025981 (10bd808) > On the other hand, the IP blocking at BETA should be terminated as soon as possible. IP ranges are not a good idea to distinguish bo...
[20:34:37] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Prevent creation of VMs on the old ipv4 network - https://phabricator.wikimedia.org/T399127#11025984 (10Andrew) p:05Triage→03Medium
[20:48:56] <wmcs-alerts>	 FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[20:49:28] <wmcs-alerts>	 FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown
[20:55:26] <wikibugs>	 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11026003 (10Wurgl) libwww-perl/6.68 is the record holder (3 times more visits that PetalBot). But PetalBot was the one, who found really time-consuming u...
[21:13:55] <wmcs-alerts>	 RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[21:14:28] <wmcs-alerts>	 RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown
[23:04:17] <wikibugs>	 (03update) 10raymond-ndibe: Draft: [maintain-harbor] add tests and configurations for new maintain-harbor jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/881 (https://phabricator.wikimedia.org/T360509)
[23:05:43] <wikibugs>	 (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509)
[23:39:28] <wmcs-alerts>	 FIRING: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.06%) at host paws-nfs-1 in project paws   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull