[10:48:22] federico3, tappof: can I merge your puppet changes? [10:48:49] yes Raine, thx [10:48:59] Tiziano Fogli: monitoring services: add migration task T367065 to instances (9aa5d078fb) [10:48:59] Federico Ceratto: CAS: Add wmf group for Zarcillo, remove ops (a7afe805c2) [10:49:00] T367065: Move profile::idp::client::httpd::site checks to Prometheus blackbox probes - https://phabricator.wikimedia.org/T367065 [10:49:03] ty tappof! [10:49:03] * vgutierrez blocked on some puppet-merge.... [10:49:49] vgutierrez: me too :D [10:49:51] federico3: is yours ok? [10:52:31] marostegui, Amir1, jynus ^^ do you know if we can proceed with that Zarcillo CR? [10:54:22] We're going to send it. [10:54:27] federico3: ^ [10:54:30] We're in the middle of a cluster upgrade, codfw is wiped. [10:54:36] revert if you don't get a reply [10:55:38] It's a one liner for a non-production app, we're sending it. [10:55:42] +1 to revert [10:55:52] O [10:55:57] no idea, that's a dba thingy [10:56:13] I am revertingt [10:56:18] marostegui: ack [10:56:19] thx marostegui [10:56:36] my CR will show up as soon as you run puppet-merge again, feel free to proceed please [10:56:43] ack, thanks vgutierrez [10:56:57] ty both [10:56:58] mergning [10:57:00] merging [10:57:43] yes, reverting, merging is the safer option, then re-reverting so it can be deployed at any time afterwards [10:58:34] The revert is merged [10:58:39] even if in the past some SREs had been mad about me doing that (after asking first and waiting, of course) [11:05:17] https://grafana.wikimedia.org/goto/qHBY6AEHR?orgId=1 [11:05:27] Emperor: ^^ what's the recommended approach? [11:05:46] I'm not seeing ant obvious traffic changes on superset [11:05:49] *any [11:06:33] I will check swift logs and logstash [11:07:00] vgutierrez: what percentage of traffic is that, is it a small one? [11:07:22] 125 rps? definitely a small one :) [11:07:33] so I wouldn't go for the rolling restart yet [11:07:38] but I am no expert [11:09:05] Amir1: I am looking at logs on ms* hosts, you could check logstash, I dont know if swift logs there tho [11:09:22] not sure either, I'm checking [11:10:00] I wonder if it could be something like thumbor (as if it was proxies, the reate would be higher), but I don't know how to debug that [11:10:33] thumbor is receiving 0 rps in codfw [11:10:37] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-2d&to=now-1m&timezone=utc&var-site=$__all&viewPanel=panel-22 [11:10:45] that's weird, isn't it, hnowlan ? [11:10:51] I see this, something is 503ing [11:10:59] we're depooled in codfw no? [11:11:07] yes\ [11:11:12] for the k8s cluster upgrade [11:11:36] it's odd to see that not reflected in the envoy graphs [11:11:44] Emperor: wdym? [11:11:56] 10:46 is when it started [11:12:08] https://grafana.wikimedia.org/goto/MxHmg1EHR?orgId=1 [11:12:37] swift-rw is pooled in codfw but not in eqiad [11:12:43] https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?from=now-2d&orgId=1&to=now&var-datasource=000000005&var-destination=$__all&var-origin=swift&var-origin_instance=$__all&timezone=utc [11:12:58] is that expected? [11:13:19] Normally when hte proxies are bust, that shows up in the envoy telemetry from swift [11:13:56] Is anyone roll-restarting the codfw swift proxies yet? If not, that's the first call I'd suspect [11:13:58] hnowlan: ah thought you meant the fact that codfw k8s was depooled [11:14:06] s/hnowlan/Emperor/ [11:14:07] Emperor: no [11:14:13] I'll do it, then [11:14:17] ok, thanks [11:14:41] Emperor: swift-rw being pooled in codfw seems wrong, no? [11:14:58] for future reference, it's always OK to do that as a first step when swift looks unhappy. [11:15:04] gotcha [11:15:10] thumbnails have been always rw? [11:15:11] I was hesitating to do it [11:17:09] hnowlan: swift-rw is always pooled in one DC, but the swift-ro and swift-rw were part of a partially-completed change from before I started and now largely serve to spread confusion [11:17:24] In practice MW will always try and write to both swift clusters for every upload [11:18:47] roll-restart of codfw ms proxies complete [11:19:03] thanks [11:20:28] So, stating the obvious, 503 are still high, right? or would it take some time? [11:20:29] it's not recovering [11:20:44] envoy graphs are showing a sustained rise in downstream latency highest-percentile, which might suggest that there's been a lot of requests for very large objects [11:21:11] I will then keep looking at superset [11:22:12] we didn't experience any kind of traffic spike on the CDN [11:22:29] upload@codfw,ulsfo,eqsin are stable in terms of number of requests and ongoing BW [11:22:35] s/ongoing/outbound/ [11:25:04] I checked logs from on fe nd one be, nothing stands out [11:26:59] 503s all look like they're thumbs [11:28:28] I ran out of ideas [11:28:50] here is my theory, if wikikube is being reimaged/wiped, etc. Thumbor can't be accessed and since thumbnails hit thumbor if they are not in swift, we are getting 500s [11:28:55] should thumbor be depooled in codfw for the cluster work given that it's running on k8s? [11:29:09] it's depooled AFAIK [11:29:18] I suggest, depooling codfw from upload too [11:29:18] or at least it isn't reporting any traffic [11:29:32] thumbor@eqiad seems to be struggling: https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&from=2025-06-23T05:28:35.629Z&to=2025-06-23T11:28:35.629Z&timezone=utc&var-quantile=0.75&var-engine=$__all&refresh=1m&viewPanel=panel-103 [11:29:37] vgutierrez: is upload also depooled? [11:29:42] hnowlan: we're wiping the wikikube cluster [11:29:42] Amir1: no [11:29:56] hnowlan: so yeah it's better that thumbor is depooled :D [11:29:56] why it should be? :) [11:29:56] that should be, otherwise missing thumbs will hit thumbor in k8s [11:30:21] am I missing something obvious? [11:30:25] Amir1: wait, what are you calling upload here? [11:30:30] upload.wm.o? [11:30:33] upload.wikimedia.org [11:30:38] vgutierrez: thumbor eqiad is doing okay by other metrics, that spike isn't too bad [11:30:43] hnowlan: ack [11:30:59] Amir1: so that's the CDN read only cluster for swift and maps [11:31:10] I'm going to depool the thumbor service, lvs still lists it as pooled [11:31:18] swift has a 404 handler that hits thumbor [11:31:29] it hits the discovery record for thumbor [11:31:31] Amir1: so you need to depool thumbor@codfw on swift [11:31:45] iirc [11:31:45] hugh is depooling it [11:31:51] it'd hit the remote dc but yeah, better than 500ing [11:31:53] if that's not feasible you need to depool swift@codfw [11:32:00] but not upload [11:33:18] thumbor is still pooled in codfw wtf [11:33:20] why [11:33:44] hnowlan: are you depooling it right now? [11:33:55] that's good, it seems to point to the core issue, and that is fisable :-) [11:33:59] *fixable [11:34:02] I'm struggling with the confctl syntax [11:34:09] if someone knows it faster than I do please go ahead [11:34:25] ok i'll do it [11:34:34] ^ hnowlan [11:34:38] done [11:34:42] thanks [11:34:55] waiting for metrics... [11:35:04] Making a note that thumbor isn't included in the kuberenetes services for some reason [11:36:56] opened https://phabricator.wikimedia.org/T397618 [11:37:15] hieradata/codfw/profile/swift/proxy.yaml:profile::swift::proxy::thumborhost: 'thumbor.svc.codfw.wmnet:8800' [11:37:20] I don't see any improvements yet, maybe the graphs have some lag [11:37:30] do we need to change that? [11:37:42] sigh [11:37:58] no change @ 36:30 [11:38:09] swift is using thumbor.svc.codfw.wmnet. apparently [11:38:16] why :'( [11:38:17] vgutierrez: yes but the impact will be quite a slow down honestly, I'm not sure if ssl is set up [11:38:18] nor at :37 [11:38:18] vgutierrez: wow...that's ...suboptimal [11:38:26] why is it hardcoding the thumbor backend [11:39:29] is someone amending that? [11:39:43] if I had to guess I think there were historical assumptions about thumbor and swift being depooled as one [11:39:52] Emperor: do you have some context on that? [11:40:56] so thumbor does not have a discovery record actually [11:41:03] That should probably change [11:41:08] claime: so what did you depool? [11:41:26] for the immediate action. There are three options: Switch swift in codfw to hit thumbor in eqiad, or put up thumbor in codfw via some magic or depool swift@codfw [11:41:51] vgutierrez: It's in conftool, but a git grep of thumbor in operations/dns does not show either a metafo/dyna record [11:41:53] which one people think they can do the fastest without causing more issues [11:41:58] ok that patch is 5 years old, many things were different back then [11:42:07] if safe I'd say do 3 [11:42:14] I vote for depooling codfw [11:42:28] we can't put up thumbor in codfw, the cluster is wiped [11:42:44] so #1 or #3, what it is Emperor? [11:43:02] are you ok with swift@eqiad handling the load of the whole site? [11:43:11] We never done #1 before, I'm worried about the implications honestly [11:43:21] unknown unknowns [11:43:28] we do #3 on eash dc switchover, right? [11:43:34] yes [11:43:35] jayme: yes [11:43:40] urg.. *jynus :D [11:43:41] sorry [11:43:51] worst case, we bump thumbor in eqiad [11:43:58] agreed [11:44:03] we were on eqiad for a week in march, we didn't have any major issues iirc [11:44:04] So, I think the known one is better, but would like ok from Emperor [11:44:50] this is user-facing, I think we should move froward [11:45:00] yes [11:45:02] I think this is also, even if not too impacting, earned to be a full blown incident [11:45:02] yes, let's depool swift@codfw please [11:45:16] who is doing it? [11:45:19] Then we should fix thumbor's discovery record and point swift to it [11:45:20] sorry, my IRC client has stopped highlighting this channel(!) [11:45:21] as it will have an interesting followup [11:45:22] Amir1, effie could you track this as a proper incident? thx [11:45:40] sorry, what are 1 and 3 here? [11:45:51] > for the immediate action. There are three options: Switch swift in codfw to hit thumbor in eqiad, or put up thumbor in codfw via some magic or depool swift@codfw [11:45:56] Emperor:1 is pointing swift@codfw to thumbor@eqiad, option 3 is depooling swift@codfw [11:46:12] I would rather not 1) [11:46:14] option 2 is not feasible at the moment [11:46:18] which I think means let's do 3 [11:46:31] thanks, that was also I think the majority suggestion :-D [11:46:41] back in 2 ticks [11:46:46] who is doing it? [11:46:48] I will create an incident on the status page [11:46:54] even if not a major one [11:47:02] I am depooling swift from codfw [11:47:03] ^ effie Amir1 [11:47:08] Thanks effie [11:47:31] I try to create a template, the bot is in _security [11:47:56] thumbnail generation is, as alluded to above, largely done via the custom 404 handler inside swift (in the wmf-specific middleware) [11:48:19] affected users will be codfw, ulsfo and eqsin, right? [11:48:23] not magru [11:48:42] jynus: yes [11:48:45] thanks [11:48:46] https://docs.google.com/document/d/1Rj6ch36csI0kE7jnJftuqyoB1pQ7omwhOSbm8Q8pfNI/edit?tab=t.0 [11:48:58] Just made a copy [11:49:02] no changes yet [11:49:06] we should be happy it hit now and not at peak US traffic [11:49:20] it had way less impact [11:49:51] the miss in swift is pretty low [11:49:52] btw the reason we never encountered this before is because when we do the switchover, thumbor and swift get depooled in relatively quick succession [11:50:14] meaning we don't have swift pooled and thumbor depooled in the same dc basically ever [11:50:40] https://www.wikimediastatus.net/ [11:50:49] "Trouble with some images" [11:50:56] since forever, that rewriting constructs a thumbor URL based on "thumborhost" in the swift config, which is templated out by puppet to be thumbor.svc.DC.wmnet:8800 - we used to also use the other DC's thumbor for the "lob thumbnails at the other DC" thing that we no longer do [11:51:28] and just to be sure, swift-rw and swift-ro do nothing? [11:52:03] hnowlan: my notes from 2022 say 'the swift dnsdisc record is what the frontend caches use; we think the swift-rw and swift-ro records were for some other applications that never got written.' [11:52:29] If the problem is "there is no thumbor in codfw", then depooling swift in codfw is an incomplete fix. [11:52:30] we really have to get rid of those, I've lost count of how often they trip me [11:53:08] (since IIRC MW tries to pre-generate stock thumbnails for new uploads, so I would presume it'll continue to try to do so in both DCs regardless of pooled status) [11:53:22] ok I depooled -rw too then [11:53:23] should we depool swift-ro@codfw just in case? [11:53:42] I thin so too [11:53:45] swift-rw is now pooled nowhere just fwiw [11:54:09] whot [11:54:21] 11:12 < hnowlan> swift-rw is pooled in codfw but not in eqiad [11:54:30] Emperor: shall I poolt it on eqiad? [11:54:37] seeing some recovery [11:54:38] 503s on the CDN are recovering [11:54:39] I think so [11:55:01] yes, but we have to understand what -rw being pooled nowhere means [11:55:01] rps up on thumbor eqiad [11:55:09] swift graphs still look horrible in codfw [11:55:37] not that it matters anymore :) [11:55:41] the other thing that I guess minimized errors is the existin cache [11:56:02] I am trying to see all good things to this [11:56:12] https://codesearch.wmcloud.org/search/?q=swift-rw&files=&excludeFiles=&repos= this is not helping much either [11:56:27] I'm still concerned that uploads will be impacted - Amir1 am I right about thumbnail pre-gen? [11:56:28] I think we need to do ro too, I don't think there is any way to distinguish ro and rw in these cases at cdn levels [11:56:56] Amir1: :? reads happen via the upload cluster, writes via the text cluster [11:57:09] swift pag.e resolved [11:57:11] Emperor: that should hit only in eqiad, I don't think we do pre-gen in codfw when it's secondary dc [11:57:15] Amir1: I will do so too [11:57:27] Amir1: ah, OK, I hadn't understood that nuance [11:57:28] depooling swift-ro from codfw [11:57:59] (or had forgotten it) [11:58:19] zero now [11:58:23] an actionable/warning for later: wikimediastatus.net cache is terrible and causing outdated info to be produced [11:58:24] Thanks <3 [11:58:28] gotta go now, happy to discuss this in our next incident review meeting [11:58:42] I will add it to the doc for awareness [11:58:44] until we get rid of/clean it up we should have an SOP for the various swift services [11:59:01] it would be great if people help me fill the incident doc https://docs.google.com/document/d/1Rj6ch36csI0kE7jnJftuqyoB1pQ7omwhOSbm8Q8pfNI/edit?tab=t.0 [11:59:32] I will, but also have to go [11:59:57] should I change status to monitoring, or too early? [12:00:31] I'd say let's change it [12:00:34] sgtm [12:00:43] thumbor eqiad is happy with the added traffic [12:01:16] I did it, but handing over the final resolution of the status page to whoever Amir or effie decide [12:01:24] when it is ready [12:01:33] have to go now, will help fill in the doc later [12:03:25] I let it be like this for ten minutes and then I resolve it [12:05:30] it seems that we don't understand why, but recovery came after depooling -rw if I am reading correctly [12:07:32] https://grafana.wikimedia.org/goto/dfLIWJPHg?orgId=1 [12:07:50] trafficserver_backend_requests_seconds_count ^ per dc [12:09:59] about 11:50 request rate to codfw dropped to ~0 [12:11:46] ok then it is all ok [12:12:31] Emperor: would you please paste the relevant graph on the doc? [12:13:08] Maybe we should investigate whether swift@ro is actually doing anything [12:14:24] since after depool of rw, it'll went to zero [12:15:23] we've previously left swift-rw and swift-ro alone (cf my note from 2022 above) because of a lack of a clear idea of exactly what their original intended use was, might be relying upon them (and the complexity of getting rid of them). [12:15:41] Which is not to say we shouldn't revisit that question. [12:15:59] effie: ack [12:16:11] I resolved the incident [12:16:16] Thanks everyone! [13:37:06] akosiaris: when you or someone else is free, could you check out this ticket for me? I need an update to site.pp before i can finish installing the servers. https://phabricator.wikimedia.org/T393121 [13:37:28] JennH: yup, looking into it. [13:38:40] ty! [13:59:20] JennH: you should be good to go [14:00:29] Awesome ty. Should get those to you today or tomorrow [14:11:10] Emperor: thumbor in codfw is pooled again, so you should be good to go repooling swift [14:14:30] We have repooled all services except for everything behind ingress, because machinetranslation is broken and undeployable, and we can't depool single services behind ingress [14:14:35] Upgrade is done [14:25:03] I have added some stuff to the incident doc in good faith, but please people review it, some may be missunderstanding on my side [14:36:10] I will repool swift codfw then [14:36:25] effie: done [14:37:25] ah! sorry! confused this channel with -operations :/ [14:37:51] If no one objects, I will do another sessionstore store reimage (sessionstore2005) shortly (no impact expected) [15:41:23] anyone encountered this during a reimage? [15:41:25] https://usercontent.irccloud-cdn.com/file/st0pB6Am/image.png [15:42:34] doesn't ring a bell [15:42:51] but I blame partman :D [15:43:49] :) I mean vgdisplay could tell you something but if you didn't tinker around with it, shouldn't the errant volume group already have been removed as part of the reinstall? [15:44:19] so.... vgdisplay does show a conflicting vg (vg0) [15:44:26] not during a reimage but in other (personal) situations: I usually fix spawning a shell and completely removing the lv setup and starting from scratch [15:44:27] and it it's made up of all the disks [15:44:37] but in this case probably the cause is different [15:44:48] fabfur: so... it's left over lvm meta data? [15:44:55] yes [15:45:21] it usually happens when a partitioning stops after the lvm setup completed [15:45:23] ok, that kind makes sense, because there are a bunch of drives that have yet to have been used, and they are pvs of this vg [15:49:53] try to remove it with `lvremove` / `vgremove` [15:50:04] fabfur: yeah, I did, trying again [15:50:11] 🤞 [15:50:20] 🤞 [15:51:23] it violates the element of least surprise (for me), that the installer wouldn't just overwrite any existing lv/vg/pv info... but this makes sense, since it's coming from *somewhere* :) [15:51:38] confidence is high! [15:52:00] that bugged me too in the past, apparently once setup a lvm, is forever :) [15:53:56] FWIW the relevant partman bits are in place in the standard recipes to ask partman to wipe lvm info (partman-lvm/device_remove_lvm) plus partman-lvm/confirm and partman-lvm/confirm_nooverwrite [15:54:02] at least as far as I can tell [15:54:42] yeah (and while not an expert), hence I was surprised to see this come up during a fresh install (again, assuming if no other changes were made during d-i itself). [16:02:00] fabfur: it worked btw; thanks! [16:02:13] great [16:03:44] godog: this is a custom recipe, but it's a fork of standard-efi.cfg, and those options are there [16:04:05] cue sadness! [16:04:37] ah [16:04:40] for real [16:21:00] ok, now I am very confused. [16:21:44] after removing all lv/vg/pv and reinstalling, it succeeded... but it wouldn't boot, because it couldn't find the vg [16:22:08] so for giggles, I reran the installer [16:22:25] and now I'm back at the conflicting vg0 screen [16:22:31] conflicting vg name [16:22:41] and, they are all there... the right ones [16:22:48] right == expected [16:27:08] godog: does the order of those bits matter? [16:32:07] I'll pool the wikikube ingress codfw again, machinetranslation in codfw is "fixed", cc jayme, claime, raine - T397148 [16:32:10] T397148: Update wikikube codfw to kubernetes 1.31 - https://phabricator.wikimedia.org/T397148 [16:40:10] i was reviewing https://wikitech.wikimedia.org/wiki/Conftool but it's not clear, whats the process to shift traffic from one DC to another? Do you just depool the entire DC and it does the rest as appropriate? [16:40:24] assuming traffic flows through .discovery.wmnet [16:41:41] we depool each service's entry in conftool for the datacentre we're working on [16:43:45] ebernhardson: yes and what claime said. then the discovery.wmnet part is the one that returns the active DC [16:44:04] hmm, i guess that's not documented in the Conftool page? I'm trying to put together a plan to test our recent switch to using discovery dns [16:44:41] (first pool the new DC and then depool the other one basically) [16:44:51] ebernhardson: perhaps we can refer it from there but it should be covered in https://wikitech.wikimedia.org/wiki/DNS/Discovery [16:45:03] ahh ok, thanks! [17:27:49] JennH: re: https://phabricator.wikimedia.org/T393015, I doubt you 'll be able to proceed much using current tooling, we 'll need to look into this host from the tooling side. Just make sure we can have remote access and we 'll take it from there (and you 'll be able to close the task). [17:29:43] cool. This one is reachable on mgmt. I got a few other test servers that i might need to just toss your way [17:30:34] cool. [17:38:34] akosiaris: this one is all to be discovered I guess :D [17:41:05] yup [17:41:08] fun fun fun [17:41:28] I 'll say I did not expect to get bash on the mgmt interface [17:41:38] eheheh [17:42:02] there is a bmcweb_persistent_data.json on the home folder once you log in [17:42:09] and I got too spoiled and tried to jq it [17:42:16] asked for too much I guess [17:42:16] rotfl [17:46:46] cool, thanks jelto