[01:24:48] 06serviceops, 13Patch-For-Review: Build php-uuid package, and add to WMF production and CI - https://phabricator.wikimedia.org/T373752#10594640 (10Reedy) p:05Triage→03Low Did they get uploaded/included? :) [08:28:34] elukey: I'm ready to whenever you are :D [08:39:26] vgutierrez: o/ I'd need ~30 mins and then I'll be ready, is it ok? [08:39:41] elukey: 100% [08:52:28] I also have the jobrunner CRs ready for its migration to IPIP encapsulation here https://gerrit.wikimedia.org/r/q/topic:%22T387295%22, who should take care of those? :) [09:00:18] actually we can do jobrunner+videoscalers at the same time, given you're currently using the same realservers for both services [09:01:45] elukey: ok to proceed with docker-registry@eqiad? [09:04:52] vgutierrez: +1 [09:05:38] merging the change [09:06:41] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Investigate burst of DBReadOnlyError during switchover test - https://phabricator.wikimedia.org/T387509#10595073 (10Volans) I'll to clarify some things that might have generated some confusion. Under normal circumstances for all core DB sections but... [09:08:20] and proceeding with the cookbook.. first step is running puppet on realservers and LVS, then it will automatically check that realservers can handle inbound traffic IPIP encapsulated, and if that's the case we can restart pybal [09:10:28] elukey: docker-registry[1004-1005] are happy with IPIP encapsulated traffic, restarting pybal now [09:11:01] super [09:11:32] so basically you deploy all the configs on the nodes including the MSS clamper, then restart pybal to pick up maglev + IPIP [09:11:56] elukey: we aren't using MSS clamper on nodes where iptables is available [09:12:44] https://www.irccloud.com/pastebin/FhDoEe8N/ [09:13:00] vgutierrez: ahh ok TIL, I thought we used it everywhere. Can I ask you a quick link or TL;DR on why? [09:13:11] elukey: eBPF is scary :) [09:14:00] no ok that I totally understand, I mean (to fix my ignorance) what do you use to set the proper MSS then? [09:14:02] so if iptables is already in place we considered adding an additional ferm rule easier to understand/debug/operate than tcp-mss-clamper [09:14:08] see my paste :) [09:14:20] we use ferm [09:14:28] super I didn't even know it was possible! [09:14:50] I'm done with eqiad, feel free to torture it [09:16:48] okok [09:23:12] the catalog works fine from eqiad (quick HTTP call) [09:23:24] now I am downloading a big image (pytorch) from build2001 [09:23:53] cool [09:31:40] vgutierrez: everything looks good from my pov [09:31:48] lovely [09:32:36] time to hit codfw then ;P [09:32:38] the only bit that I don't get for the codfw patch is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1123414/3/hieradata/role/common/docker_registry_ha/registry.yaml [09:32:47] since there are settings stated two times [09:33:00] yikes.. let me fix that [09:34:02] done [09:35:50] +1ed! [09:43:56] registry[2004-2005] are happy with IPIP traffic, restarting pybal [09:44:21] super [09:45:55] all good, thx elukey [09:47:57] vgutierrez: <3 [09:48:03] 06serviceops, 13Patch-For-Review: Migrate docker-registry LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T387294#10595203 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [09:48:07] I am going to warn the on-callers just in case [10:04:07] hmm what's the current status of kartotherian? everything is being served by k8s and the regular instances are idling there? [10:06:58] elukey: ^^ pinging you cause apparently you were the last playing with profile::lvs::realservers:pools for kartotherian [10:08:14] elukey: also.. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1090426/2/hieradata/role/common/maps/replica.yaml this is "wrong", each service should be a systemd service unit name, not a service catalog name [10:09:48] 06serviceops: Migrate kartotherian LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T387297#10595377 (10Vgutierrez) 05Open→03Stalled current puppetization is broken and service apparently has been moved to k8s [10:10:09] 06serviceops: Migrate kartotherian-ssl LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T387298#10595381 (10Vgutierrez) 05Open→03Stalled current puppetization is broken and service apparently has been moved to k8s [10:18:46] vgutierrez: atm we are migrating to k8s, but we are waiting for some days since it requires a bit of capacity and there are some tests lined up for these days (php8.1, eqiad depool, etc..) [10:19:19] vgutierrez: okok I'll try to see if I can fix it [10:19:33] elukey: a simple revert should fix it [10:19:37] if you could hold off for kartotherian it would be great, we can do it for last [10:19:43] (of that specific change, not the whole CR) [10:19:49] okok [10:19:53] elukey: sure, no rush [10:20:49] jayme, hnowlan: could I get some help from any of you for https://gerrit.wikimedia.org/r/q/topic:%22T387295%22 <3, it's the migration of jobrunner|videoscaler (non-k8s) to IPIP [10:21:43] vgutierrez: also, there is a parallel note worth to be mentioned - for a lot of k8s services it would make sense to move them to Istio Ingress, leveraging a single LVS IP (pointing to istio k8s ingress, that then routes the traffic) rather than having an LVS IP for each service. Migrating services to Ingress is not difficult, but it could require time etc.. [10:22:26] the Ingres LVS is k8s-ingress-wikikube [10:22:40] yeah.. that's currently out of scope for my IPIP migration [10:22:49] I'm only targeting non-k8s low-traffic services [10:22:59] okok super, didn't know [10:23:10] <_joe_> vgutierrez: uhh I don't think there should be anything left under jobrunner/videoscaler [10:23:10] k8s is blocked till its able to handle IPIP [10:23:30] _joe_: two LVS services and several realservers assigned [10:24:00] <_joe_> vgutierrez: yeah I assume they're left around to act as scap proxies at this point, but uhm [10:24:04] <_joe_> hnowlan: ^ [10:24:36] <_joe_> I would assume we can repurpose those servers, but most importantly dismiss the LVS services at this point [10:24:58] yeah if you get rid of the LVS services that works for me as well :D [10:25:10] I'll bill the CR time in beers though [10:25:14] ;P [10:25:23] <_joe_> lol [10:25:46] or crostata.. probably beer would be cheaper [10:32:09] 06serviceops, 13Patch-For-Review: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038#10595444 (10Lucas_Werkmeister_WMDE) >>! In T377038#10589041, @Scott_French wrote: > Also, that's an interesting observation about execution failure (due to a malformed PHP source file)... [10:42:53] last service I've identified from servicops to be migrated is restbase, CRs are available here: https://gerrit.wikimedia.org/r/q/topic:%22T387299%22 please let me know who could review them and I'll take care of the rest, thanks <3 [10:45:34] vgutierrez: those servers will be reclaimed this week if that changes your need for those CRs [10:45:39] the jobrunners that is [10:45:44] (and videoscalers) [10:46:08] hnowlan: so I'm guessing you'll be removing the LVS services? [10:46:30] yeah, but not top of the list right now - if it'd be easier I'm happy for those CRs to proceed [10:46:43] does the migration cookbook get run before or after merging those? [10:47:14] I can take care of that for you, reviewing the CRs is more than enough [10:47:28] the cookbook itself will ask you to merge the change before proceeeding [10:49:19] thanks for that [10:49:37] thx for the reviews .D [11:01:02] * vgutierrez proceeding with jobrunner|videoscaler@codfw [11:02:44] ack [11:05:13] realservers are accepting IPIP inbound traffic, restarting pybal [11:10:13] hnowlan: lvs2013 got restarted already, you can validate codfw if needed :) [11:10:50] (cookbook is currently blocked cause dse k8s is triggering a backend healths alert) [11:11:52] FIRING: [2x] SystemdUnitFailed: prometheus_ferm_mss.service on mw2278:9100 [11:11:55] that's definitely me [11:11:57] * vgutierrez looking [11:15:06] Mar 03 11:14:03 mw2278 prometheus-ferm-mss[19147]: def call_iptables(version=4) -> list[str]: [11:15:06] Mar 03 11:14:03 mw2278 prometheus-ferm-mss[19147]: TypeError: 'type' object is not subscriptable [11:15:15] yikes.. buster :) [11:15:19] <_joe_> vgutierrez: busterrr [11:15:21] <_joe_> yep [11:15:26] we got busted lol [11:15:34] <_joe_> I would advise removing the lvs endpoints :) [11:16:21] lvs errors for ingress on kubestage in codfw es expected currenty will fix early this week [11:16:46] vgutierrez: looking [11:17:13] service should be OK, MSS monitoring not so much [11:18:38] hmm typing has been introduced in python 3.5 and mw2278 has 3.7.3 [11:22:47] volans: any idea why python 3.7.3 isn't happy with `def call_iptables(version=4) -> list[str]:` [11:22:56] <_joe_> list [11:22:59] volans: particularly with `list[str]` [11:22:59] <_joe_> instead of [11:23:07] <_joe_> from typing inport List [11:23:11] <_joe_> and then using that [11:23:13] vgutierrez: looks okay from the service end [11:23:18] hnowlan: thx [11:23:45] <_joe_> vgutierrez: you can't use list, dict, etc in type definitions before python 3.9 [11:23:46] <_joe_> IIRC [11:23:57] <_joe_> so you need to import the type classes from "typing" [11:23:59] vgutierrez: list[] vs List[] [11:24:01] <_joe_> List, Dict, etc [11:24:03] in py3.7 [11:24:06] sorry in a meeting [11:24:10] thx <3 [11:24:18] I'll address that on the .py script then [11:25:57] you're in the lucky group like ourselves that still need to support buster? [11:26:00] :-P [11:28:07] that would be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124068 [11:35:16] hnowlan: MSS monitoring is happy now in mw2278, proceeding with eqiad [11:35:59] great [11:41:21] realservers are happy with IPIP traffic in eqiad, restarting pybal [11:43:09] hnowlan: done, all good from my side [11:44:12] vgutierrez: thank you! [11:44:34] 06serviceops, 13Patch-For-Review: Migrate jobrunner LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T387295#10595599 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [11:44:39] 06serviceops, 13Patch-For-Review: Migrate videoscaler LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T387296#10595602 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [11:45:11] hnowlan: shall I proceed with restbase now or after lunch? [11:50:41] vgutierrez: whenever suits, now is fine for me [11:51:02] cool, hitting codfw first [11:51:24] lvs errors for ingress on kubestage in codfw should be gone now fwiw [11:54:34] jayme: nice, thx <3 [11:59:36] hnowlan: realservers look good, restarting pybal [11:59:51] vgutierrez: ack, thanks [12:01:04] done :) [12:01:10] please check that everything looks good on your side [12:07:53] looks okay! [12:10:04] hnowlan: cool, proceeding with eqiad [12:16:18] realservers are happy with IPIP traffic, restarting pybal [12:17:44] all good from my side, thx again hnowlan [12:18:03] 06serviceops, 13Patch-For-Review: Migrate restbase-backend LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T387299#10595680 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [12:18:11] 06serviceops, 13Patch-For-Review: Migrate restbase-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T387300#10595683 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [12:20:53] vgutierrez: great, thank you very much! [13:32:58] 06serviceops, 10MW-on-K8s, 06SRE Observability: Periodic job alerting - https://phabricator.wikimedia.org/T385709#10595920 (10fgiunchedi) >>! In T385709#10578781, @Clement_Goubert wrote: > Hmm so obviously it's not as simple as I thought it would be. If I understand our [[ https://github.com/wikimedia/operat... [13:41:31] 06serviceops, 06SRE Observability: chartmuseum prometheus metrics cardinality spam - https://phabricator.wikimedia.org/T386808#10595954 (10fgiunchedi) >>! In T386808#10591100, @kamila wrote: > The fix hasn't been merged. I pinged them, I'll see if I can get things moving. > > As for where these requests come... [14:09:02] hmm, parsoid p75 jumped 100% at 11:25, staying high [14:09:23] ew [14:17:06] not entirely unusual I guess [14:53:31] <_joe_> hnowlan: it would be interesting to see if it matches some traffic pattern/UA in the parsoid access logs [15:19:47] _joe_: looks like a spike in RecordLintJobs in eqiad from 0600 onwards (which is when p75 started to increase, up to 4x of baseline rather than 2x) [15:19:54] which I hope is benign :) [15:21:04] <_joe_> and they call the parsoid cluster? TIL [15:23:18] actually no, recordlintjobs are triggered by parsoid parses [15:23:34] but parsoid access logs show increased traffic from the jobqueue [15:57:38] 06serviceops, 13Patch-For-Review: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038#10596699 (10Scott_French) Ah, thanks Lucas! Yes, that makes sense then, and indeed I may have seen something similar before, when shellbox is somehow induced to emit content "early" (de... [16:01:35] 06serviceops, 13Patch-For-Review: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038#10596730 (10Scott_French) Wow, that was quick. Got one! Logstash: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2025.03.03?id=lwG5XJUBuXzFNByTNe... [16:11:27] 06serviceops, 10MW-on-K8s, 06SRE Observability: Periodic job alerting - https://phabricator.wikimedia.org/T385709#10596762 (10Clement_Goubert) >>! In T385709#10595920, @fgiunchedi wrote: >>>! In T385709#10578781, @Clement_Goubert wrote: >> Hmm so obviously it's not as simple as I thought it would be. If I un... [16:25:31] 06serviceops, 07Datacenter-Switchover: Spicerack support for mw-cron in periodic_jobs functions - https://phabricator.wikimedia.org/T387753 (10hnowlan) 03NEW [16:30:39] <_joe_> hnowlan: so yeah I think it's some scraper [17:18:02] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Migrate release template inheritance in helmfiles from YAML anchors to the inherit field - https://phabricator.wikimedia.org/T387760 (10JMeybohm) 03NEW [19:33:27] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10597983 (10Scott_French) As of ~ 18:45 UTC, traffic on mw-api-ext / mw-web has stabilized (as with previous increments, this takes ~ 15m) at pretty much exactly the expected levels [0]... [19:34:05] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10597984 (10Scott_French) [20:18:30] 06serviceops, 10MW-on-K8s, 06SRE, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629#10598113 (10dancy) >>! In T288629#10582102, @JMeybohm wrote: > I stumbled upon this again recently and I think the current con... [20:20:03] 06serviceops, 10MW-on-K8s, 06SRE, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629#10598121 (10dancy) 05Open→03Resolved a:03dancy ` After the build process creates the restricted mediawiki-multiversi... [20:50:08] 06serviceops, 10MediaWiki-extensions-OAuth: Allow a user to disable an OAuth client - https://phabricator.wikimedia.org/T254190#10598194 (10Krinkle) [20:50:17] 06serviceops, 10MediaWiki-extensions-OAuth: Allow a user to disable an OAuth client - https://phabricator.wikimedia.org/T254190#10598197 (10Krinkle) [20:50:27] 06serviceops, 10MediaWiki-extensions-OAuth: Allow a user to disable an OAuth 2.0 client - https://phabricator.wikimedia.org/T254190#10598201 (10Krinkle) [20:50:49] 06serviceops, 10MediaWiki-extensions-OAuth: Allow developers to disable their own OAuth 2.0 clients - https://phabricator.wikimedia.org/T254190#10598202 (10Krinkle) [20:52:32] 06serviceops, 10MediaWiki-extensions-OAuth: Allow developers to disable their own OAuth 2.0 clients - https://phabricator.wikimedia.org/T254190#10598219 (10Krinkle) Context from the duplicate T234670 task: While log-less deletion of applications may be undesirable, this task proposes that the creator/owner of... [20:53:34] 06serviceops, 10MediaWiki-extensions-OAuth: Allow developers to disable their own OAuth 2.0 clients - https://phabricator.wikimedia.org/T254190#10598221 (10Krinkle) [20:55:32] 06serviceops, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team: Allow developers to disable their own OAuth 2.0 clients - https://phabricator.wikimedia.org/T254190#10598225 (10Krinkle) [20:56:14] 06serviceops, 10MediaWiki-extensions-OAuth, 10MediaWiki-Platform-Team (Roadmap): Allow developers to disable their own OAuth 2.0 clients - https://phabricator.wikimedia.org/T254190#10598226 (10Krinkle) [20:58:05] claime: For the cron jobs looks for a team, all the mediawiki_job_update* and mediawiki_job_cron-refreshlinks* are I suppose officially owned by MW Platform, but in practice I think Amir1 is / SRE Data Persistence are the actual experts. [21:11:11] 06serviceops, 10MediaWiki-extensions-OAuth, 10MediaWiki-Platform-Team (Roadmap): Allow developers to disable their own OAuth 2.0 clients - https://phabricator.wikimedia.org/T254190#10598281 (10Tgr) TBH I am not sure about the value of exposing everything via REST API, rather than just making the API portal a... [22:47:17] 06serviceops, 13Patch-For-Review: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038#10598675 (10Scott_French) Alright, progress. After a bit of debugging in staging, there are two things going on here. One is that we're clearly running into `post_max_size` in some ca... [22:48:10] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Implement periodic maintenance scripts for mw-on-k8s - https://phabricator.wikimedia.org/T341555#10598677 (10Krinkle) [22:56:57] 06serviceops, 10MW-on-K8s: Document the new periodic cronjobs infrastructure on Wikitech - https://phabricator.wikimedia.org/T385783#10598739 (10Krinkle) [23:40:15] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Investigate burst of DBReadOnlyError during switchover test - https://phabricator.wikimedia.org/T387509#10598843 (10Scott_French) Just to echo / build on what @Volans is saying for extra clarity: The issue in this task is entirely limited to setting...