[08:07:24] (03CR) 10Hashar: [C:03+2] Add TestKitchen dependency to WikimediaEvents [integration/config] - 10https://gerrit.wikimedia.org/r/1241869 (https://phabricator.wikimedia.org/T417068) (owner: 10Kareid) [08:09:10] (03Merged) 10jenkins-bot: Add TestKitchen dependency to WikimediaEvents [integration/config] - 10https://gerrit.wikimedia.org/r/1241869 (https://phabricator.wikimedia.org/T417068) (owner: 10Kareid) [08:16:43] 10Gerrit, 06collaboration-services: gerrit: replication monitoring improvement - https://phabricator.wikimedia.org/T418084 (10ABran-WMF) 03NEW [08:27:49] 10Gerrit, 06collaboration-services, 13Patch-For-Review, 07Puppet: Gerrit git replication should not break when Puppet changes its config - https://phabricator.wikimedia.org/T416929#11639182 (10hashar) The short fix is to [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1238043 | disable configuratio... [08:41:54] (03CR) 10Hashar: [C:03+2] build: Updating dependencies [integration/docroot] - 10https://gerrit.wikimedia.org/r/1241071 (owner: 10Libraryupgrader) [08:46:36] 06Release-Engineering-Team, 10Scap: git pull on a scap deployment repository deletes scap sync tags - https://phabricator.wikimedia.org/T418085 (10hashar) 03NEW [09:03:36] (03CR) 10Hashar: [C:03+2] Zuul: [mediawiki/extensions/CollabPads] Use quibble-bluespice [integration/config] - 10https://gerrit.wikimedia.org/r/1241059 (owner: 10Umherirrender) [09:05:17] (03Merged) 10jenkins-bot: Zuul: [mediawiki/extensions/CollabPads] Use quibble-bluespice [integration/config] - 10https://gerrit.wikimedia.org/r/1241059 (owner: 10Umherirrender) [09:06:34] 10Gerrit, 06collaboration-services, 13Patch-For-Review: gerrit: replication monitoring improvement - https://phabricator.wikimedia.org/T418084#11639274 (10ABran-WMF) p:05Triage→03Medium [09:07:08] (03CR) 10Hashar: [C:03+2] "Deployed" [integration/config] - 10https://gerrit.wikimedia.org/r/1241059 (owner: 10Umherirrender) [09:26:46] 06Release-Engineering-Team, 06translatewiki.net: MediaWiki translation sync broken - https://phabricator.wikimedia.org/T418087 (10Nikerabbit) 03NEW [09:27:22] Project beta-scap-sync-world build #246533: 04FAILURE in 2 min 2 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/246533/ [09:37:31] Yippee, build fixed! [09:37:31] Project beta-scap-sync-world build #246534: 09FIXED in 2 min 12 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/246534/ [09:38:00] 06Release-Engineering-Team, 06translatewiki.net: MediaWiki translation sync broken - https://phabricator.wikimedia.org/T418087#11639429 (10Joe) Do you happen to know what is the user-agent used by the script? Or point me to the source code for it. [09:46:15] 06Release-Engineering-Team, 06translatewiki.net: MediaWiki translation sync broken - https://phabricator.wikimedia.org/T418087#11639503 (10Joe) Ah seen the script (thanks @Nikerabbit); first thing to do is make the user-agent compliant with the wikimedia User-Agent policy, so it should contain an email address... [09:46:34] 10Gerrit, 06Release-Engineering-Team, 06collaboration-services, 13Patch-For-Review: Rename gerrit2 unix user to gerrit and assign a fixed uid - https://phabricator.wikimedia.org/T338470#11639506 (10ABran-WMF) [10:07:37] 10Gerrit, 06Release-Engineering-Team, 06collaboration-services: gerrit: gerrit-replica behind CDN - https://phabricator.wikimedia.org/T418108 (10ABran-WMF) 03NEW [10:07:50] 10Gerrit, 06Release-Engineering-Team, 06collaboration-services: gerrit: gerrit-replica behind CDN - https://phabricator.wikimedia.org/T418108#11639691 (10ABran-WMF) [10:07:53] 10Gerrit, 06Release-Engineering-Team, 06collaboration-services, 06Traffic, 13Patch-For-Review: ATS: align ATS and Gerrit Apache timeouts to reenable connection re-use - https://phabricator.wikimedia.org/T417998#11639690 (10ABran-WMF) [10:09:14] 10Gerrit, 06Release-Engineering-Team, 06collaboration-services, 06Traffic, 13Patch-For-Review: ATS: align ATS and Gerrit Apache timeouts to reenable connection re-use - https://phabricator.wikimedia.org/T417998#11639695 (10ABran-WMF) 05Open→03Stalled p:05Triage→03Medium This is currently blocked... [10:09:34] 10Gerrit, 06Release-Engineering-Team, 06collaboration-services, 06Traffic, 13Patch-For-Review: ATS: align ATS and Gerrit Apache timeouts to reenable connection re-use - https://phabricator.wikimedia.org/T417998#11639703 (10ABran-WMF) a:03ABran-WMF [10:09:51] 10Gerrit, 06Release-Engineering-Team, 06collaboration-services: gerrit: gerrit-replica behind CDN - https://phabricator.wikimedia.org/T418108#11639705 (10ABran-WMF) 05Open→03In progress p:05Triage→03High [10:20:56] 10Continuous-Integration-Infrastructure, 07Jenkins, 06collaboration-services: Update Jenkins hosts from Java 17 to Java 21 - https://phabricator.wikimedia.org/T418109 (10jnuche) 03NEW [10:51:17] 10Gerrit, 06collaboration-services, 13Patch-For-Review: gerrit: replication monitoring improvement - https://phabricator.wikimedia.org/T418084#11639819 (10hashar) From my comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1238315/comments/a98b94a7_a77584ef , the Gerrit replication plugin exposes... [11:19:39] 10Gerrit, 06collaboration-services, 13Patch-For-Review: gerrit: replication monitoring improvement - https://phabricator.wikimedia.org/T418084#11639934 (10hashar) I have also added the per replicas retries (whatever it means) counts and rate: {F72279291 width=500} [11:30:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance deployment-puppetserver-1 in project deployment-prep - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [11:30:38] 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-puppetserver-1 in project deployment-prep - https://phabricator.wikimedia.org/T418120 (10wmcs-alerts) 03NEW [11:32:20] 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team, 06SRE, 10SRE-Access-Requests: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11640007 (10Rsilvola) [11:32:56] 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team, 06SRE, 10SRE-Access-Requests: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11640023 (10Rsilvola) 05Open→03Declined Hello @Dzahn, Much of this was already filled in [T4... [12:04:40] 06Release-Engineering-Team, 06translatewiki.net: MediaWiki translation sync broken - https://phabricator.wikimedia.org/T418087#11640145 (10Joe) p:05Triage→03Medium a:03Joe [12:05:03] maintenance-disconnect-full-disks build 783601 integration-agent-docker-1057 (/: 37%, /srv: 97%, /var/lib/docker: 28%): OFFLINE due to disk space [12:05:11] Project beta-code-update-eqiad build #588867: 04FAILURE in 14 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/588867/ [12:10:02] maintenance-disconnect-full-disks build 783602 integration-agent-docker-1057 (/: 37%, /srv: 83%, /var/lib/docker: 27%): RECOVERY disk space OK [12:15:17] Yippee, build fixed! [12:15:18] Project beta-code-update-eqiad build #588868: 09FIXED in 2 min 17 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/588868/ [12:22:58] 10Phabricator, 06Release-Engineering-Team (Priority Backlog 📥): Decrease number of open Phab tickets with assignee field set for more than two years (aka cookie licking) (Q2/2026 edition) - https://phabricator.wikimedia.org/T418127 (10Aklapper) 03NEW p:05Triage→03Low [12:23:01] 10Phabricator, 06Release-Engineering-Team (Doing 😎): Decrease number of open Phab tickets with assignee field set for more than two years (aka cookie licking) (Q4/2025 edition) - https://phabricator.wikimedia.org/T397713#11640303 (10Aklapper) 05Stalled→03Resolved [12:25:42] (03open) 10aklapper: Uninstall Countdown application [repos/phabricator/deployment] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/phabricator/deployment/-/merge_requests/95 (https://phabricator.wikimedia.org/T418033) [12:26:19] 10Phabricator, 13Patch-For-Review: Uninstall Countdown (Phabricator application) - https://phabricator.wikimedia.org/T418033#11640312 (10Aklapper) [13:13:21] 06Release-Engineering-Team, 06translatewiki.net: MediaWiki translation sync broken - https://phabricator.wikimedia.org/T418087#11640435 (10Joe) 05Open→03Resolved Added an exception for translatewiki so they can keep syncing from gerrit at the previous rate, given that never created an issue for us. [13:14:47] 10Diffusion, 10Phabricator, 06Release-Engineering-Team (Priority Backlog 📥): Disable "diffusion.allow-http-auth" - https://phabricator.wikimedia.org/T418045#11640440 (10Aklapper) [13:16:12] (03open) 10aklapper: Remove / disable diffusion.allow-http-auth [repos/phabricator/deployment] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/phabricator/deployment/-/merge_requests/96 (https://phabricator.wikimedia.org/T418045) [13:16:56] 10Gerrit, 06collaboration-services, 13Patch-For-Review: gerrit: replication monitoring improvement - https://phabricator.wikimedia.org/T418084#11640451 (10ABran-WMF) Thanks @hashar for the graphs! The issue we had was happening between `12:xx` and `18:xx` and I found nothing reflecting that in these graphs.... [13:40:27] 10Gerrit, 06collaboration-services, 13Patch-For-Review: gerrit: replication monitoring improvement - https://phabricator.wikimedia.org/T418084#11640533 (10ABran-WMF) I think I found something interesting on the dashboard [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1238315/comments/a98b94a7_a77584... [13:45:02] maintenance-disconnect-full-disks build 783621 integration-agent-docker-1054 (/: 51%, /srv: 98%, /var/lib/docker: 22%): OFFLINE due to disk space [13:50:03] maintenance-disconnect-full-disks build 783622 integration-agent-docker-1051 (/: 36%, /srv: 97%, /var/lib/docker: 24%): OFFLINE due to disk space [13:50:03] maintenance-disconnect-full-disks build 783622 integration-agent-docker-1054 (/: 51%, /srv: 66%, /var/lib/docker: 21%): RECOVERY disk space OK [13:55:03] maintenance-disconnect-full-disks build 783623 integration-agent-docker-1051 (/: 36%, /srv: 56%, /var/lib/docker: 24%): RECOVERY disk space OK [14:00:09] 10Gerrit, 06collaboration-services, 13Patch-For-Review: gerrit: replication monitoring improvement - https://phabricator.wikimedia.org/T418084#11640604 (10ABran-WMF) [14:37:17] https://etherpad-backup.toolforge.org/p/-1-to-100-Team [14:37:27] " [Chad] - Testing *is* faster, we fixed that ~2 weeks ago. Tests should only take 10-15mins for the full suite now, instead of 2-3 hours." [14:37:55] Puts things in perspective :) [14:46:48] 10Phabricator, 06Release-Engineering-Team (Doing 😎), 10Tool-gitlab-account-approval: Mark glaab Phabricator account as bot in database - https://phabricator.wikimedia.org/T407690#11640711 (10Aklapper) 05Open→03Resolved a:03Aklapper Done. FYI Conduit logs: https://phabricator.wikimedia.org/conduit/... [14:50:34] 10Phabricator, 06Release-Engineering-Team (Doing 😎): Mark Pywikibugs Phabricator account as bot in database - https://phabricator.wikimedia.org/T410571#11640720 (10Aklapper) 05Open→03Resolved a:03Aklapper Done. FYI Conduit logs: https://phabricator.wikimedia.org/conduit/log/query/3QErnZRvzCYM/#R [14:52:04] 10Gerrit, 06collaboration-services: gerrit: nft_throttling_denylists triggers NodeTextfileStale alert - https://phabricator.wikimedia.org/T418139 (10ABran-WMF) 03NEW [14:53:17] 10Gerrit, 06collaboration-services, 13Patch-For-Review: gerrit: nft_throttling_denylists triggers NodeTextfileStale alert - https://phabricator.wikimedia.org/T418139#11640750 (10ABran-WMF) 05Open→03In progress p:05Triage→03Low [14:53:26] 10Phabricator, 06Release-Engineering-Team (Doing 😎), 10ReleaseTaggerBot: Mark ReleaseTaggerBot Phabricator account as bot in database - https://phabricator.wikimedia.org/T329748#11640753 (10Aklapper) 05Open→03Resolved a:03Aklapper Done. FYI hourly Conduit logs: https://phabricator.wikimedia.org/co... [15:01:57] 10Gerrit, 06Release-Engineering-Team, 06collaboration-services, 13Patch-For-Review: gerrit: gerrit-replica behind CDN - https://phabricator.wikimedia.org/T418108#11640773 (10ABran-WMF) [15:03:22] 10Gerrit, 06collaboration-services, 06Traffic, 07ci-test-error (WMF-deployed Build Failure), 13Patch-For-Review: ATS causes git fetches from Gerrit to fail with 502 responses - https://phabricator.wikimedia.org/T417536#11640775 (10ABran-WMF) [15:03:23] 10Gerrit, 06Release-Engineering-Team, 06collaboration-services, 13Patch-For-Review: gerrit: gerrit-replica behind CDN - https://phabricator.wikimedia.org/T418108#11640776 (10ABran-WMF) [15:03:24] 10Phabricator, 06Release-Engineering-Team (Priority Backlog 📥): Mark DeploymentCalendarTool Phabricator account as bot in database - https://phabricator.wikimedia.org/T418140 (10Aklapper) 03NEW p:05Triage→03Low [15:03:24] 10Gerrit, 06Release-Engineering-Team, 06collaboration-services, 06Traffic, 13Patch-For-Review: ATS: align ATS and Gerrit Apache timeouts to reenable connection re-use - https://phabricator.wikimedia.org/T417998#11640777 (10ABran-WMF) [15:03:31] 10Phabricator, 06Release-Engineering-Team (Priority Backlog 📥): Mark Wmdephabbot Phabricator account as bot in database - https://phabricator.wikimedia.org/T418141 (10Aklapper) 03NEW p:05Triage→03Low [15:32:05] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07Zuul, and 3 others: Make puppet-compiler execution run with higher priority, not like other 'experimental' jobs - https://phabricator.wikimedia.org/T414621#11640983 (10LSobanski) p:05Triage→03Medium [15:32:30] (03CR) 10Jforrester: "Yes, apparently it's still in active development in that repo as well as pulled into thumbor (?) somewhere." [integration/config] - 10https://gerrit.wikimedia.org/r/1240998 (owner: 10Jforrester) [15:33:25] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Infrastructure-Foundations: Upgrade releng/operations-puppet CI image from Bullseye to Bookworm - https://phabricator.wikimedia.org/T417976#11640984 (10LSobanski) [15:33:43] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Infrastructure-Foundations: Upgrade releng/operations-puppet CI image from Bullseye to Bookworm - https://phabricator.wikimedia.org/T417976#11640985 (10LSobanski) p:05Triage→03Medium [15:50:40] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team (Seen), 06Infrastructure-Foundations: Make it possible to run the mediawiki testsuite against a staging repo of apt.wikimedia.org - https://phabricator.wikimedia.org/T157038#11641071 (10MoritzMuehlenhoff) 05Open→03Declined This is no... [16:17:41] 10Gerrit, 06collaboration-services, 13Patch-For-Review: gerrit: nft_throttling_denylists triggers NodeTextfileStale alert - https://phabricator.wikimedia.org/T418139#11641285 (10ABran-WMF) [16:24:40] (03merge) 10jforrester: Remove leading newline from the "what" parameters in the calendar [repos/releng/release] - 10https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/235 (owner: 10matmarex) [16:25:51] (03CR) 10Jforrester: [C:03+2] Zuul: [3d2png] Add basic Node CI at version 20 [integration/config] - 10https://gerrit.wikimedia.org/r/1240998 (owner: 10Jforrester) [16:28:10] (03Merged) 10jenkins-bot: Zuul: [3d2png] Add basic Node CI at version 20 [integration/config] - 10https://gerrit.wikimedia.org/r/1240998 (owner: 10Jforrester) [16:29:02] !log Zuul: [3d2png] Add basic Node CI at version 20 [16:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:51:13] 10Continuous-Integration-Infrastructure (Zuul upgrade), 10Gerrit, 06Release-Engineering-Team, 06collaboration-services: Fix up Gerrit sshd.idleTimeout - https://phabricator.wikimedia.org/T417996#11641444 (10LSobanski) p:05Triage→03Medium [16:53:19] 10Continuous-Integration-Infrastructure (Zuul upgrade), 10Gerrit, 06Release-Engineering-Team, 06collaboration-services: Fix up Gerrit sshd.idleTimeout - https://phabricator.wikimedia.org/T417996#11641455 (10LSobanski) Is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240243 related? [16:58:57] 10Gerrit, 06Release-Engineering-Team, 06collaboration-services: Create an OpenSearch dashboard for Gerrit sshd logs - https://phabricator.wikimedia.org/T417753#11641502 (10LSobanski) p:05Triage→03Medium [17:10:24] (03CR) 10Hnowlan: "In a less-than-ideal flow, when thumbor images are built, 3d2png's `master` is also installed to use the system nodejs (currently bookworm" [integration/config] - 10https://gerrit.wikimedia.org/r/1240998 (owner: 10Jforrester) [17:29:26] (03open) 10dancy: Remove redundant php_entrypoint.sh [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/91 [17:29:28] (03update) 10dancy: Remove redundant php_entrypoint.sh [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/91 [17:36:32] (03update) 10dancy: Use direct reference to php_entrypoint.sh instead of symlink [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/91 [17:36:35] (03update) 10dancy: Use direct reference to php_entrypoint.sh instead of symlink [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/91 [17:45:57] (03close) 10dancy: Use direct reference to php_entrypoint.sh instead of symlink [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/91 [17:45:58] (03update) 10dancy: Use direct reference to php_entrypoint.sh instead of symlink [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/91 [18:19:05] (03merge) 10aklapper: Hide unneeded "SSH Public Keys" personal settings panel [repos/phabricator/phabricator] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/phabricator/phabricator/-/merge_requests/113 (https://phabricator.wikimedia.org/T418044) [18:19:40] 10Phabricator (phabricator-next), 06Release-Engineering-Team (Doing 😎): Hide unneeded "SSH Public Keys" personal settings panel - https://phabricator.wikimedia.org/T418044#11641855 (10Aklapper) [18:20:03] maintenance-disconnect-full-disks build 783676 integration-agent-docker-1041 (/: 37%, /srv: 97%, /var/lib/docker: 23%): OFFLINE due to disk space [18:20:03] maintenance-disconnect-full-disks build 783676 integration-agent-docker-1051 (/: 36%, /srv: 97%, /var/lib/docker: 26%): OFFLINE due to disk space [18:25:03] maintenance-disconnect-full-disks build 783677 integration-agent-docker-1041 (/: 37%, /srv: 84%, /var/lib/docker: 22%): RECOVERY disk space OK [18:25:03] maintenance-disconnect-full-disks build 783677 integration-agent-docker-1051 (/: 36%, /srv: 69%, /var/lib/docker: 26%): RECOVERY disk space OK [18:45:15] we don't have database backups for the beta cluster, do we? [18:45:43] (i didn't break anything too badly) [18:47:15] MatmaRex: we don't as far as I know. There was once a deliberately lagged replica but I think that may be gone these days. [18:49:00] alright. fyi, i ran a maintenance script: https://phabricator.wikimedia.org/P88982 and i think the output is lying to me. i'd like to find out what exactly it did, but i guess we'll never know [19:07:20] 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team, 06SRE, 10SRE-Access-Requests: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11642002 (10Rsilvola) 05Declined→03Open (reopening) [19:31:36] 06Release-Engineering-Team (Doing), 10MediaWiki-extensions-InterwikiSorting, 05MW-1.46-notes (1.46.0-wmf.16; 2026-02-17), 13Patch-For-Review, and 2 others: Undeploy the InterwikiSorting extension from Wikipedia production - https://phabricator.wikimedia.org/T253764#11642138 (10Iniquity) One of the main... [20:26:45] !log Deleted "replication-upstream" Grafana dashboard in favor of a copy/new "replication" one. https://grafana.wikimedia.org/d/RFLS1GsWk/replication-upstream , replaced it by https://grafana.wikimedia.org/d/d4a4da73-c27f-4ce6-a9e5-ab84dd7a4ebb/replication [20:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:27:25] 06Release-Engineering-Team, 10dev-images, 10Catalyst (Radar): dev-images: bookworm-php-sury children fail to build on sury key - https://phabricator.wikimedia.org/T417711#11642333 (10dancy) a:03dancy I'm running into this problem today so I'll take this ticket. [20:27:30] 06Release-Engineering-Team, 10dev-images, 10Catalyst (Radar): dev-images: bookworm-php-sury children fail to build on sury key - https://phabricator.wikimedia.org/T417711#11642335 (10dancy) 05Open→03In progress p:05Triage→03Medium [20:50:36] (03open) 10dancy: bookworm-php-sury: Update sury-php GPG key [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/92 (https://phabricator.wikimedia.org/T417711) [20:50:37] (03update) 10dancy: bookworm-php-sury: Update sury-php GPG key [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/92 (https://phabricator.wikimedia.org/T417711) [20:52:26] (03update) 10dancy: bookworm-php-sury: Update sury-php GPG key [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/92 (https://phabricator.wikimedia.org/T417711) [21:16:39] (03update) 10dancy: bookworm-php-sury: Update sury-php GPG key [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/92 (https://phabricator.wikimedia.org/T417711) [21:16:39] (03open) 10dancy: php_entrypoint.sh: Run php-fpm under tini [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/93 [21:16:39] (03update) 10dancy: php_entrypoint.sh: Run php-fpm under tini [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/93 [21:16:46] (03update) 10dancy: php_entrypoint.sh: Run php-fpm under tini [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/93 [21:16:46] (03update) 10dancy: bookworm-php-sury: Update sury-php GPG key [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/92 (https://phabricator.wikimedia.org/T417711) [21:19:00] (03update) 10dancy: php_entrypoint.sh: Run php-fpm under tini [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/93 [21:24:00] (03CR) 10Dmaza: [C:03+1] "What would be the ideal flow?" [integration/config] - 10https://gerrit.wikimedia.org/r/1240998 (owner: 10Jforrester) [21:42:25] (03approved) 10jhuneidi: bookworm-php-sury: Update sury-php GPG key [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/92 (https://phabricator.wikimedia.org/T417711) (owner: 10dancy) [21:42:54] (03update) 10dancy: php_entrypoint.sh: Run php-fpm under tini [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/93 [21:42:54] (03merge) 10dancy: bookworm-php-sury: Update sury-php GPG key [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/92 (https://phabricator.wikimedia.org/T417711) [21:54:25] Now I've gotten my mitigation patch for this deployed and have time to look at secondary issues... anyone here in a position to have an idea what's going on with https://api.wikimedia.org/service/lw/inference/v1/models/edit-check:predict 404ing? [22:12:03] !log Unblock 191.80.192.0/18 (T418132) [22:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [22:18:29] (03update) 10dduvall: digitalocean: Separate management of cluster and in-cluster resources [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/552 (https://phabricator.wikimedia.org/T416260) [22:18:32] (03update) 10dduvall: digitalocean: Separate management of cluster and in-cluster resources [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/552 (https://phabricator.wikimedia.org/T416260) [22:23:26] 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-puppetserver-1 in project deployment-prep - https://phabricator.wikimedia.org/T418120#11642781 (10bd808) `lang=shell-session,counterexample bd808@deployment-puppetserver-1.deployment-prep.eqiad1:~$ sudo -i puppet agent -tv Info:... [22:35:44] sorry, it's me again… is the beta cluster able to send emails? specifically, emails from the Special:ConfirmEmail feature? i am trying to change my email address there and haven't gotten a confirmation email [22:41:48] 10Gerrit, 06collaboration-services, 13Patch-For-Review: gerrit: replication monitoring improvement - https://phabricator.wikimedia.org/T418084#11642836 (10hashar) Awesome! For reference the metric is: ` events_ref_replication_scheduled_total - events_ref_replicated_total ` I could not explain those two down... [22:42:31] MatmaRex: I believe so [22:42:49] (03open) 10jhuneidi: Draft: catalyst-*: Refresh for gpg key update [repos/releng/dev-images] - 10https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/94 (https://phabricator.wikimedia.org/T417711) [22:43:04] !log Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/92 [22:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [22:44:19] MatmaRex: in theory yes, but T212327 might still be a thing. [22:44:20] T212327: Beta Cluster mailer not sending emails to @wikimedia.org addresses - https://phabricator.wikimedia.org/T212327 [22:44:48] yeah, i just found that… T285527 and T291679 too [22:44:49] T285527: Unable to confirm email address on beta cluster - https://phabricator.wikimedia.org/T285527 [22:44:49] T291679: "Sender address rejected: Domain not found" for emails sent from the beta cluster - https://phabricator.wikimedia.org/T291679 [22:45:01] fwiw i used a @fastmail.com address [22:51:25] MatmaRex: I don't see any stuck mails outbound to fastmail on deployment-mx04. The message might not have gotten that far though. Did you look for mediawiki logs? [22:52:50] yup, haven't found anything relevant, except "Changing email address for Yatu from A to B" [22:53:14] 10Gerrit, 06collaboration-services, 13Patch-For-Review: gerrit: replication monitoring improvement - https://phabricator.wikimedia.org/T418084#11642858 (10hashar) Another thing I looked at tonight is the delay/latency graph. They show the values for different quantiles which remains when there is no replicat... [22:53:30] this guy seems smart and proposed a solution (back in 2021): https://phabricator.wikimedia.org/T291679#7474212 [22:55:01] i guess i could do what he says? it probably won't break anything worse than it is. i have to say i don't know what i'm doing though [22:58:17] There is a floating IP setup in the Deployment-prep Cloud VPS project with the description "For MX server". [22:58:33] * bd808 looks closer [22:59:37] huh. not mapped to any instance and the Horizon UI doesn't give me any logging about the history of that object [23:00:00] 10Gerrit, 06collaboration-services, 13Patch-For-Review: gerrit: replication monitoring improvement - https://phabricator.wikimedia.org/T418084#11642872 (10hashar) Also over the week-end I felt like we should have a view of the replication plugin WorkQueue, it does not emit metrics which I think was forgotten... [23:00:38] it does reverse to "mail.beta.wmflabs.org", "wikimedia.beta.wmflabs.org", and "wikimedia.beta.wmcloud.org" [23:01:13] Beta is a hot mess and I don't know what to do about that [23:06:38] bd808: Beta got used to validate/pretest an extension that deals with emails and or for a mx migration maybe? [23:07:02] !log Updated development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/92 [23:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:07:22] i'll try doing the thing chlod suggested and see if it does anything [23:07:25] Puppet has a bunch of history for ""deployment-mx"" [23:07:55] and I think anything email related can most probably be withdrawn in favor of using whatever email relay is available inside of WMCS [23:08:06] hashar: I think mostly the config rotted over time and nobody at all tries to fix it because, well, it's beta and beta is broken [23:08:35] well sometime it is helpful :] [23:08:55] "just use another MX" is the sort of thing everyone says and nobody tries [23:08:57] I once found a bug in our Varnish VCL because beta upgraded automatically [23:09:30] * bd808 is in the middle of writing up an existing Puppet bug for Beta [23:09:32] I think the deployment-mx was made because some wanted to reproduced every single bit of infra we had [23:09:40] or maybe there was a use case for testing something involving the mx [23:09:52] I know it was made because Cloud VPS did not have a shared MX [23:10:02] AH! [23:10:11] that sounds like a reasonable reason to set one up [23:10:25] sahred MX servers are hard [23:10:37] * hashar nods [23:10:47] I am SO happy to not have to manage an email server anymore [23:11:08] or a bind dns server :] [23:11:38] I have bind and sendmail books from the 90s on my shelf here... [23:12:34] * hashar smiles [23:12:41] we are getting old [23:13:16] but I managed to mess up with some Grafana dashboard / PromQL and digging into metrics this evening [23:13:23] so I am quite happy :] I am not THAT old afterall! [23:13:30] https://grafana-rw.wikimedia.org/d/d4a4da73-c27f-4ce6-a9e5-ab84dd7a4ebb/replication ! [23:14:31] well, i did this: https://horizon.wikimedia.org/ngdetails/OS::Designate::RecordSet/4482e09c-3d25-447f-b8e2-5aa2a105b60e/99124be1-c471-4c74-b2f9-b4288ae3447c it doesn't seem to have helped [23:16:00] 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-puppetserver-1 in project deployment-prep - https://phabricator.wikimedia.org/T418120#11642920 (10bd808) `lang=shell-session bd808@mbp03:~/projects/wmf/operations/puppet$ git log --after=2026-02-21 $(git grep -l analytics-sre) 94... [23:26:11] MatmaRex: I think we should delete that rather than leaving it around to possibly confuse folks in the future. Pointing it at an outbound MX would I think only help if the outside world was just desperate for any type=MX result too. [23:26:57] bd808: i was just about to comment on the task with a link, to see if anyone there is interested in helping debug it [23:26:57] * bd808 is working on getting puppet to work on the puppetserver at the moment [23:28:45] MatmaRex: ok. maybe add a note to the record that points back to the phab task then? [23:29:01] and I see you did that/ never mind [23:29:01] i did [23:29:04] unless i messed it up [23:29:07] 10Beta-Cluster-Infrastructure: "Sender address rejected: Domain not found" for emails sent from the beta cluster - https://phabricator.wikimedia.org/T291679#11642952 (10matmarex) Hi, I found this task, as I am also unable to confirm my email address on the beta cluster. >>! In T291679#7474212, @Chlod wrote: >... [23:29:07] ah :) [23:29:16] i wasn't sure if putting a URL in that field would break anything [23:29:39] but i did that now [23:29:45] the full url should be fine. that's just a database field and not really DNS anything [23:31:49] I see there is an SPF record for wikimedia.beta.wmcloud.org that points to the unused floating IP. That's probably not helping things either. [23:40:00] 10Beta-Cluster-Infrastructure: "Sender address rejected: Domain not found" for emails sent from the beta cluster - https://phabricator.wikimedia.org/T291679#11642973 (10LucasWerkmeister) Well, I just tried sending an email to myself via Special:EmailUser and it worked fine. So from my side this seems to have wor... [23:55:37] 10Beta-Cluster-Infrastructure: "Sender address rejected: Domain not found" for emails sent from the beta cluster - https://phabricator.wikimedia.org/T291679#11642995 (10matmarex) It depends on the recipient address. I can receive the emails at an @gmail.com address (and I could receive them earlier today as well...