[01:06:11] FIRING: SystemdUnitDown: The systemd unit hdfs_rsync_mediawiki_content_history.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:55:52] 10Tool-campwiz-nxt: CampWiz Nxt Redesign: Root Path - https://phabricator.wikimedia.org/T415408#11695632 (10Remy_Christophe) Hello @Nokib_Sarkar @Tiven2240, I’ve just submitted a pull request for the migration of the Submission section from Next.js to Vite/React: Migrated pages: SubmissionListPage, EvaluationL... [05:06:11] FIRING: SystemdUnitDown: The systemd unit hdfs_rsync_mediawiki_content_history.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:00:57] RESOLVED: SystemdUnitDown: The systemd unit hdfs_rsync_mediawiki_content_history.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:59:18] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Controlled cloudsw down tests for D5 - https://phabricator.wikimedia.org/T419656 (10fgiunchedi) 03NEW [08:02:37] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Controlled cloudsw down tests for D5 - https://phabricator.wikimedia.org/T419656#11696152 (10fgiunchedi) [08:06:45] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Controlled cloudsw down tests for E4 - https://phabricator.wikimedia.org/T419657 (10fgiunchedi) 03NEW [08:08:18] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Controlled cloudsw down tests for E4 - https://phabricator.wikimedia.org/T419657#11696168 (10JJMC89) [08:08:34] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Controlled cloudsw down tests for F4 - https://phabricator.wikimedia.org/T419658 (10fgiunchedi) 03NEW [08:18:05] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393#11696203 (10fgiunchedi) Plan is to grab another announced maint window on Tues March 17th to resume the testing. I have also opened subtasks for the remai... [08:25:28] (03update) 10raymond-ndibe: [status] make job status an enum, with clearly defined states [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/208 (https://phabricator.wikimedia.org/T401172) [08:49:26] 10Tools, 06All-and-every-Wikisource, 06Community-Tech, 10Wikimedia OCR: Wikisource OCR UI supplies a non-standard thumbnail size to the OCR tool hosted on the cloud - https://phabricator.wikimedia.org/T419246#11696296 (10ShakespeareFan00) There is also the consideration, that via a user script, I can utili... [09:10:49] 06cloud-services-team, 10Toolforge: [components-api] bump the openapi version on every change - https://phabricator.wikimedia.org/T401374#11696342 (10Raymond_Ndibe) not sure if we should do this. We had a decision request and the final decision was to manually handle version bumping https://wikitech.wikimedia.... [09:31:40] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge (Toolforge iteration 26), 13Patch-For-Review: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble - https://phabricator.wikimedia.org/T380127#11696394 (10dcaro) I propose having a process like: - Regular operation:... [10:00:21] 06cloud-services-team, 10Toolforge: [components-api] bump the openapi version on every change - https://phabricator.wikimedia.org/T401374#11696501 (10dcaro) Yep, I think this wolud the equivalent for components-api of the jobs-api pre-commit https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/ma... [10:01:14] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge (Toolforge iteration 26), 13Patch-For-Review: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble - https://phabricator.wikimedia.org/T380127#11696505 (10fnegri) > I propose having a process like: Looks good to me! [10:06:38] 10Tool-delintbot: Mention what was fixed in the edit summary - https://phabricator.wikimedia.org/T416068#11696533 (10Redmin) [10:07:21] 10Tool-delintbot: Handle cases of extension tags with template parameters - https://phabricator.wikimedia.org/T416008#11696535 (10Redmin) [10:09:10] 10Tool-delintbot: Add ability to fetch list of pages with lint errors using the API - https://phabricator.wikimedia.org/T418577#11696540 (10Redmin) [10:19:32] 10Tool-delintbot: Fix cases of tags not being closed correctly - https://phabricator.wikimedia.org/T417483#11696559 (10Redmin) Hi @Kavaljeet_Singh, any update? :) [10:26:28] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge: ToolforgeKubernetesCapacity alert actionability - https://phabricator.wikimedia.org/T419674 (10fgiunchedi) 03NEW [10:59:06] (03PS1) 10Arendpieter: Use IDP for authentication [labs/striker] - 10https://gerrit.wikimedia.org/r/1250537 (https://phabricator.wikimedia.org/T359554) [11:01:37] (03CR) 10CI reject: [V:04-1] Use IDP for authentication [labs/striker] - 10https://gerrit.wikimedia.org/r/1250537 (https://phabricator.wikimedia.org/T359554) (owner: 10Arendpieter) [11:05:14] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge: ToolforgeKubernetesCapacity alert actionability - https://phabricator.wikimedia.org/T419674#11696753 (10dcaro) Related {T404726} [11:07:02] (03PS2) 10Arendpieter: Use IDP for authentication [labs/striker] - 10https://gerrit.wikimedia.org/r/1250537 (https://phabricator.wikimedia.org/T359554) [11:07:29] 06cloud-services-team, 10Striker, 10CAS-SSO, 13Patch-For-Review: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#11696763 (10Arendpieter) @taavi [[https://gerrit.wikimedia.org/r/c/labs/striker/+/1250537 | This is the second attempt]], where I made several different choices... [11:19:52] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge: ToolforgeKubernetesCapacity alert actionability - https://phabricator.wikimedia.org/T419674#11696855 (10dcaro) Memory is tricky in that if your job hits the limit, or the host has no more free memory (if we overcommited and more than one job uses more tha... [11:25:13] 06cloud-services-team, 10Tool-spacemedia, 10Toolforge: [Build service] latest builder has old Java - https://phabricator.wikimedia.org/T405415#11696901 (10dcaro) >>! In T405415#11694405, @Don-vip wrote: > @dcaro is this update live for us? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-depl... [11:47:46] (03PS3) 10Arendpieter: Use IDP for authentication [labs/striker] - 10https://gerrit.wikimedia.org/r/1250537 (https://phabricator.wikimedia.org/T359554) [11:48:04] 10Tool-wiktlexbot: Create a weekly cron job - https://phabricator.wikimedia.org/T419686 (10Redmin) 03NEW [11:49:13] (03CR) 10CI reject: [V:04-1] Use IDP for authentication [labs/striker] - 10https://gerrit.wikimedia.org/r/1250537 (https://phabricator.wikimedia.org/T359554) (owner: 10Arendpieter) [11:52:38] (03PS4) 10Arendpieter: Use IDP for authentication [labs/striker] - 10https://gerrit.wikimedia.org/r/1250537 (https://phabricator.wikimedia.org/T359554) [12:36:57] 10Tools, 06All-and-every-Wikisource, 06Community-Tech, 10Wikimedia OCR: Wikisource OCR UI supplies a non-standard thumbnail size to the OCR tool hosted on the cloud - https://phabricator.wikimedia.org/T419246#11697161 (10Samwilson) Perhaps we should look again at allowing Internet Archive source images? Th... [13:27:24] 10Tool-wmf-openapi-linter, 06MW-Interfaces-Team: Add tests to ensure consistency between OAD example and OpenAPI linter - https://phabricator.wikimedia.org/T419576#11697336 (10BPirkle) p:05Triage→03Medium [13:31:21] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge: ToolforgeKubernetesCapacity alert actionability - https://phabricator.wikimedia.org/T419674#11697360 (10fgiunchedi) 05Open→03Invalid Thank you @dcaro for the pointer to T404726 ! I went through it again and it was a good read; I'm resolving this o... [13:45:03] 06cloud-services-team, 10Toolforge: Add new alerts for Toolforge cluster high load - https://phabricator.wikimedia.org/T414513#11697453 (10fgiunchedi) Following up from {T419674} >>! In T419674#11696855, @dcaro wrote: > Memory is tricky in that if your job hits the limit, or the host has no more free memory (... [15:44:07] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393#11698352 (10Andrew) >>! In T417393#11696203, @fgiunchedi wrote: > Plan is to grab another announced maint window on Tues March 17th to resume the testing.... [15:45:33] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393#11698356 (10Andrew) Oh, to check the maintenance state of a host you want to look at the host aggregates. Docs for that here: https://wikitech.wikimedia.or... [15:51:19] (03open) 10dcaro: toolforge_deploy: add restore-all [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/314 [15:51:39] (03update) 10dcaro: toolforge_deploy: add restore-all [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/314 [15:52:17] FIRING: JobUnavailable: Reduced availability for job pdns_rec in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:53:42] FIRING: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [15:57:17] RESOLVED: JobUnavailable: Reduced availability for job pdns_rec in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:23:42] RESOLVED: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [16:53:42] FIRING: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [16:57:09] FIRING: [2x] ProbeDown: Service virt.cloudgw.codfw1dev.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_codfw1dev_wikimediacloud_org_from_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:02:09] RESOLVED: [4x] ProbeDown: Service virt.cloudgw.codfw1dev.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_codfw1dev_wikimediacloud_org_from_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:09] FIRING: [8x] ProbeDown: Service virt.cloudgw.codfw1dev.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_codfw1dev_wikimediacloud_org_from_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:17:09] RESOLVED: [8x] ProbeDown: Service virt.cloudgw.codfw1dev.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_codfw1dev_wikimediacloud_org_from_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:28] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: cloudgw2004-dev service implementation - https://phabricator.wikimedia.org/T418765#11698847 (10Andrew) 05Open→03Resolved 2004-dev is up and working now, thanks to @taavi and a reboot. [17:20:47] 06cloud-services-team, 10decommission-hardware: decommission cloudgw2002-dev - https://phabricator.wikimedia.org/T419738 (10Andrew) 03NEW [17:21:00] 06cloud-services-team, 10decommission-hardware: decommission cloudgw2002-dev - https://phabricator.wikimedia.org/T419738#11698866 (10Andrew) [17:23:42] RESOLVED: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [17:26:20] 06cloud-services-team, 10Toolforge: Add new alerts for Toolforge cluster high load - https://phabricator.wikimedia.org/T414513#11698884 (10dcaro) Something we can improve there also is that they are metrics with a big cardinality (has the pod/namespace and such), so maybe we want to aggregate them somehow at t... [17:52:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance toolsbeta-test-k8s-etcd-32 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:53:42] FIRING: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [17:53:43] 06cloud-services-team, 10Toolforge, 10Tools, 06Security-Team, and 3 others: tools.buckbot leaks its k8s and DB credentials on GitHub - https://phabricator.wikimedia.org/T419311#11698991 (10Alachuckthebuck) 05Open→03Resolved This is resolved now that any keys are stale. Thanks for everyone's assista... [17:53:53] 06cloud-services-team, 10Toolforge, 10Tools, 06Security-Team, and 3 others: tools.buckbot leaks its k8s and DB credentials on GitHub - https://phabricator.wikimedia.org/T419311#11698993 (10sbassett) p:05Triage→03High [18:14:08] 06cloud-services-team, 10Toolforge: [envars] Provide a $TOOL or $TOOL_NAME default envvar - https://phabricator.wikimedia.org/T419601#11699102 (10dcaro) p:05Triage→03Medium [18:16:32] 10Cloud-VPS (Quota-requests): Quota increases for gitlab-runners - https://phabricator.wikimedia.org/T418813#11699111 (10dcaro) 05In progress→03Resolved [18:23:42] RESOLVED: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [18:26:31] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [components-api] restart rather than delete/create continuous jobs - https://phabricator.wikimedia.org/T403321#11699166 (10dcaro) 05Open→03Resolved a:03dcaro This was fix already, let me know if I'm mistaken. [18:53:42] FIRING: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [19:23:42] RESOLVED: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [19:53:42] FIRING: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [20:23:42] RESOLVED: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [20:32:59] 06cloud-services-team, 10decommission-hardware, 13Patch-For-Review: decommission cloudgw2002-dev - https://phabricator.wikimedia.org/T419738#11699813 (10Andrew) decom script says: ` 2026-03-11 20:26:04,769 DRY-RUN andrew 1603380 [INFO] Powered off 2026-03-11 20:26:06,402 DRY-RUN andrew 1603380 [INFO] Disabl... [20:41:35] 06cloud-services-team, 10decommission-hardware: decommission cloudgw2002-dev - https://phabricator.wikimedia.org/T419738#11699854 (10Andrew) [20:53:42] FIRING: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [21:06:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudgw2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:23:42] RESOLVED: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [21:53:42] FIRING: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [22:06:14] 10Tool-containers: rproxy running on Toolforge fails to successfully proxy requests to musicbrainz.org - https://phabricator.wikimedia.org/T419777 (10bd808) 03NEW [22:12:39] 10Tool-containers: rproxy running on Toolforge fails to successfully proxy requests to musicbrainz.org - https://phabricator.wikimedia.org/T419777#11700155 (10bd808) p:05Triage→03High While writing this up I realized that there was another `curl` test that I had not attempted that might provide new informati... [22:16:02] 10Tool-containers: Create a reusable container to replace nginx ingress anonymizing reverse proxy setups - https://phabricator.wikimedia.org/T414836#11700159 (10bd808) I forked the musicbrainz problem out to {T419777}. I am not sure at this point that musicbrainz is the only broken upstream (seems unlikely reall... [22:17:48] 10Tool-containers: rproxy running on Toolforge fails to successfully proxy requests to musicbrainz.org - https://phabricator.wikimedia.org/T419777#11700163 (10taavi) One more data point: the request fails if run from a Kubernetes worker node but from outside Kubernetes. `lang=shell-session taavi@tools-k8s-worker... [22:23:42] RESOLVED: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [22:27:33] 10Tool-containers: rproxy running on Toolforge fails to successfully proxy requests to musicbrainz.org - https://phabricator.wikimedia.org/T419777#11700189 (10bd808) >>! In T419777#11700163, @taavi wrote: > Is our shared NAT egress address being blocked upstream? nat.cloudgw.eqiad1.wikimediacloud.org (185.15.56... [22:53:42] FIRING: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [23:23:42] RESOLVED: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [23:53:42] FIRING: AlertLintProblem: Linting problems found for NeutronAgentAdminDown - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem