[08:59:46] 06serviceops, 06Infrastructure-Foundations: Image publishing via docker-pkg on build2001 repeatedly failing - https://phabricator.wikimedia.org/T385531#10520004 (10elukey) I started `build-production-images` on build2001 in a tmux session and left running, I found it this morning and it successfully completed.... [09:20:37] 06serviceops, 06Infrastructure-Foundations: Image publishing via docker-pkg on build2001 repeatedly failing - https://phabricator.wikimedia.org/T385531#10520053 (10elukey) Background: the Docker registry hosts run nginx as TLS terminator. It runs with a tmpfs 4G partition, that holds `/var/lib/nginx`. This mea... [09:24:02] 06serviceops, 06Infrastructure-Foundations: Image publishing via docker-pkg on build2001 repeatedly failing - https://phabricator.wikimedia.org/T385531#10520075 (10elukey) Added also a [[ https://grafana-rw.wikimedia.org/d/StcefURWz/docker-registry?forceLogin&orgId=1&viewPanel=51&from=now-24h&to=now | graph ]]... [10:24:32] 06serviceops, 06Infrastructure-Foundations: Image publishing via docker-pkg on build2001 repeatedly failing - https://phabricator.wikimedia.org/T385531#10520272 (10jijiki) After chatting with @elukey, out current status is that: * this will happen again, so lookout for other builds being uploaded at the same t... [10:34:59] 06serviceops, 06Infrastructure-Foundations: Image publishing via docker-pkg on build2001 repeatedly failing - https://phabricator.wikimedia.org/T385531#10520305 (10isarantopoulos) I checked the compressed image size for the updated `amd-pytorch25` it now is 4.1 GB in total. ([[ https://gerrit.wikimedia.org/r/c... [10:52:10] 06serviceops, 06Infrastructure-Foundations: Image publishing via docker-pkg on build2001 repeatedly failing - https://phabricator.wikimedia.org/T385531#10520350 (10elukey) The image was uploaded (https://docker-registry.wikimedia.org/amd-pytorch25/tags/ but I think we are getting really close to the limit. We... [11:04:38] 06serviceops, 10Citoid, 06Editing QA, 06Editing-team, and 3 others: Switchover plan from restbase to api gateway for Citoid - https://phabricator.wikimedia.org/T361576#10520404 (10Mvolz) >>! In T361576#10519300, @Ryasmeen wrote: >>>! In T361576#10515182, @Mvolz wrote: >> This is now available to test on te... [12:06:54] 06serviceops, 06Infrastructure-Foundations: Image publishing via docker-pkg on build2001 repeatedly failing - https://phabricator.wikimedia.org/T385531#10520614 (10elukey) >>! In T385531#10520350, @elukey wrote: > The image was uploaded (https://docker-registry.wikimedia.org/amd-pytorch25/tags/ but I think we... [13:27:14] 06serviceops, 10PoolCounter, 10MediaWiki-Platform-Team (Radar): poolcounter-exporter upgrade - https://phabricator.wikimedia.org/T333947#10520958 (10fgiunchedi) 05Open→03Resolved This is done, poolcounter-exporter upgraded and the related lint alert is gone too. [13:27:17] 06serviceops, 06MediaWiki-Platform-Team, 10PoolCounter, 10Sustainability (Incident Followup): Add monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729#10520962 (10fgiunchedi) [13:53:15] 06serviceops, 06Content-Transform-Team-WIP, 10Maps (Kartotherian): Difftesting between staging and production - https://phabricator.wikimedia.org/T384530#10521030 (10Jgiannelos) A: kartotherian in current bare metal prod B: kartotherian in prod k8s pod Latest difftesting run after fixing hanging connections... [14:02:33] 06serviceops, 06Content-Transform-Team-WIP, 10Maps (Kartotherian): Difftesting between staging and production - https://phabricator.wikimedia.org/T384530#10521047 (10Jgiannelos) Here is the latency quantiles for each A/B test run. {F58355280} [14:06:32] 06serviceops, 06Content-Transform-Team-WIP, 10Maps (Kartotherian): Difftesting between staging and production - https://phabricator.wikimedia.org/T384530#10521064 (10Jgiannelos) I things its pretty safe to continue with the migration. Closing this ticket for now. We can run the tests again in the future if i... [14:06:42] 06serviceops, 06Content-Transform-Team-WIP, 10Maps (Kartotherian): Difftesting between staging and production - https://phabricator.wikimedia.org/T384530#10521075 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos [14:35:02] 06serviceops, 06Content-Transform-Team-WIP, 10Maps (Kartotherian): Difftesting between staging and production - https://phabricator.wikimedia.org/T384530#10521199 (10Jgiannelos) {F58355469} [14:45:00] 06serviceops, 06Infrastructure-Foundations: Image publishing via docker-pkg on build2001 repeatedly failing - https://phabricator.wikimedia.org/T385531#10521237 (10elukey) Summary: the main issue seemed to be the pytorch base image crossing the 4GB max compressed Docker layer size, it was reverted and cleaned... [15:08:46] 06serviceops, 10MW-on-K8s: Create a logstash dashboard for mediawiki periodic jobs - https://phabricator.wikimedia.org/T385594 (10Clement_Goubert) 03NEW [15:08:51] 06serviceops, 10MW-on-K8s: Create a logstash dashboard for mediawiki periodic jobs - https://phabricator.wikimedia.org/T385594#10521398 (10Clement_Goubert) p:05Triage→03Medium [15:13:11] 06serviceops: Align mw-on-k8s alerts with PHP 8.1 migration - https://phabricator.wikimedia.org/T384532#10521405 (10Scott_French) That should now cover all of the mw-on-k8s alerts. Thanks @jijiki! [15:13:32] 06serviceops: Align mw-on-k8s alerts with PHP 8.1 migration - https://phabricator.wikimedia.org/T384532#10521406 (10Scott_French) 05Open→03Resolved [15:16:10] 06serviceops, 10MW-on-K8s: Identify low-criticality maintenance job to move to mwcron - https://phabricator.wikimedia.org/T377963#10521413 (10Clement_Goubert) I've started a [[ https://docs.google.com/spreadsheets/d/1AXG1WVOmo9Hp2u3OTmCNbWOT27qM-sJayzo4Uv8KIa4/edit?gid=0#gid=0 | migration tracking spreadsheet... [15:22:50] 06serviceops, 10MW-on-K8s: Allow defining kubernetes cronjobs through puppet - https://phabricator.wikimedia.org/T385596 (10Clement_Goubert) 03NEW [15:23:12] 06serviceops, 10MW-on-K8s: Allow defining kubernetes cronjobs through puppet - https://phabricator.wikimedia.org/T385596#10521443 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [15:47:57] 06serviceops, 06Infrastructure-Foundations: Image publishing via docker-pkg on build2001 repeatedly failing - https://phabricator.wikimedia.org/T385531#10521519 (10Scott_French) Thank you all! I can confirm the 8.1.34-1-20250203 images are now present in the registry, so this seems to have succeeded on a subse... [16:03:35] 06serviceops, 13Patch-For-Review: Mercurius does not retry failed transcodes beyond 15m - https://phabricator.wikimedia.org/T385225#10521622 (10Scott_French) 05Open→03Resolved Many thanks to folks for investigating T385531 earlier today, the new mercurius version is live as of ~ 14:30 UTC. I'll follow... [16:14:43] 06serviceops, 06Commons, 10Shellbox, 10TimedMediaHandler-Transcode, and 2 others: Videos intermittently failing to transcode with error "Exception: Shellbox server returned status code 503" - https://phabricator.wikimedia.org/T385365#10521655 (10Scott_French) Thanks for connecting the dots, here, @jijiki,... [16:21:00] 06serviceops, 06Commons, 10Shellbox, 10TimedMediaHandler-Transcode, and 2 others: Videos intermittently failing to transcode with error "Exception: Shellbox server returned status code 503" - https://phabricator.wikimedia.org/T385365#10521692 (10A_smart_kitten) Thanks, @Scott_French! Would it be safe to sa... [16:50:56] 06serviceops, 06Commons, 10Shellbox, 10TimedMediaHandler-Transcode, and 2 others: Videos intermittently failing to transcode with error "Exception: Shellbox server returned status code 503" - https://phabricator.wikimedia.org/T385365#10521796 (10Scott_French) Thanks, @A_smart_kitten. If a transcode that st... [18:13:05] 06serviceops, 06Commons, 10Shellbox, 10TimedMediaHandler-Transcode, and 2 others: Videos intermittently failing to transcode with error "Exception: Shellbox server returned status code 503" - https://phabricator.wikimedia.org/T385365#10522173 (10hnowlan) Unfortunately shellbox doesn't give us a lot of gran...