[06:58:07] claime, jayme - o/ to recap the situation of maps: me and Moritz are testing the new codfw stack on Bookworm but we are still in a testing phase, we keep incurring in problems and during the last repool people noticed even small things not working in maps. At the moment the old stack is lagging updates, because we stopped them; I restarted the updater daemon yesterday but it will take some days to catch up (if we pool the stack without a [06:58:07] full sync some maps may be rendered without updates etc.. and people notice :D). [06:58:43] ideally if we waited until next week it could be better for me and moritz, we may be able to just repool the new stack and that's it (or to have the old one fully synced and ready to be used in case) [06:59:15] I am terribly sorry to ask this, I know that you already sent the announce, I didn't realize that we were going to upgrade eqiad this week :( [07:00:20] the alternative is to re-add the old stack without a full sync, and inform the community about this gap for the small outage window of the k8s upgrade [07:23:29] elukey: no worries. We could have double checked for still pooled services in eqiad before announcing the update...either way is fine for me but let's wait for c.lem before making a call [07:24:04] AIUI maps has a 95% up SLO, so we could probably even take it offline for the duration of the upgrade [07:24:48] stating that maps has an SLO is a big word :D [07:25:06] there may be something on paper but no error budget is formalized anywhere [09:22:31] elukey: if we would pool codfw maps (old stack) in codfw...what would happen (in terms of user impact)? [09:22:47] potentially old map data served in case it's not cached? [09:22:50] 06serviceops, 06MW-Interfaces-Team (MWI-Sprint-19 (2025-09-23 to 2025-10-07)), 07OKR-Work, 13Patch-For-Review: Execute test plan for rest gateway rerouting for rest.php requests and report findings - https://phabricator.wikimedia.org/T405368#11228253 (10Clement_Goubert) >>! In T405368#11226157, @aaron wrot... [09:23:49] I'm asking because the real time window for the upgrade should be rather short if we prioritize re-deploying kartotherian [09:26:16] 06serviceops, 10Proton, 10RESTBase-API, 06Content-Transform-Team (Work In Progress), and 2 others: bad request when attempting to download pdfs - https://phabricator.wikimedia.org/T405957#11228267 (10Jgiannelos) [09:26:22] 06serviceops, 10Proton, 10RESTBase-API, 06Content-Transform-Team (Work In Progress), and 2 others: bad request when attempting to download pdfs - https://phabricator.wikimedia.org/T405957#11228268 (10Jgiannelos) a:03Jgiannelos [09:41:07] jayme: sorry got distracted :( [09:41:45] so if we repool right now, the community will notice for sure, it happened the last time as well (we got reports of maps not rendered correctly etc..) [09:42:05] the lag is pretty big, I think I'd need 2/3 days to get everything in sync [09:42:24] the new stack is in sync but we have to fix some issues, and rushing to prod may lead to more troubles [09:43:54] having said that, we can try to announce the outage window, I know some people that I can contact etc.. (and also we could use wikitech-l etc..) [09:44:05] I need to run now, but we can chat later ok? [09:52:39] 06serviceops, 10Proton, 10RESTBase-API, 06Content-Transform-Team (Work In Progress), and 2 others: bad request when attempting to download pdfs - https://phabricator.wikimedia.org/T405957#11228493 (10Clement_Goubert) 05Open→03Resolved ` cgoubert@deploy2002:~$ for svc in staging.svc.eqiad rest-gatew... [10:25:19] 06serviceops, 10Proton, 10RESTBase-API, 06Content-Transform-Team (Work In Progress), and 2 others: bad request when attempting to download pdfs - https://phabricator.wikimedia.org/T405957#11228629 (10Aklapper) p:05Triage→03High [12:33:00] jayme, claime - I am working on two fronts: 1) letting the old maps stack to catch up (as much as possible) with the latest osm data 2) making the new stack working (should be close but there are some issues, I am testing now) [12:33:20] elukey: can I help? [12:36:25] claime: nono thanks, I'll keep you updated [12:37:17] elukey: I can send a followup email to the announcement to add that maps may experience degradation at least [12:38:26] elukey: just to have it said: If the new stak produces some errors but works generally it might still be fine to go with it since the downtime will be rather short [12:38:33] *stack [12:44:43] jayme: don't jinx the downtime window :D [12:44:49] claime: okok makes sense [12:47:58] elukey: do you have a task for the kartotherian work that I can link to? [12:51:38] probably https://phabricator.wikimedia.org/T381565 [12:52:46] Cool thanks [12:52:59] I'll go grab a bite to eat and write that up afterwards [13:22:58] claime: for when you are back - the msg at this point may be more hopeful with a note that we are testing the new stack, and if any issue is registered etc.. [13:31:34] ack [13:31:42] meeting, then email [14:00:36] 06serviceops, 06collaboration-services, 10Prod-Kubernetes, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), and 3 others: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703#11229506 (10Gehel) [16:10:18] update on maps - after a chat with Yiannis we discovered that the tegola's cache on swift is probably not up-to-date, so I am re-creating all the 90M tiles stored in there [16:10:45] there is a process for it, namely I load events on kafka and then a cronjob on k8s picks the work up [16:13:56] IIUC the whole work starts at 10 UTC, that should be my midday, so I should have plenty of time to check the new tegola cache and see if we can repool codfw [17:18:53] 06serviceops, 06MW-Interfaces-Team (MWI-Sprint-19 (2025-09-23 to 2025-10-07)), 07OKR-Work, 13Patch-For-Review: Execute test plan for rest gateway rerouting for rest.php requests and report findings - https://phabricator.wikimedia.org/T405368#11230433 (10hnowlan) >>! In T405368#11211211, @aaron wrote: > I'm...