[06:25:24] andrew.bogott: glad to help. For the kubectl you probably want to look in the codfw1dev (I think it's called) namespace. Default won't have much in it [07:39:52] good morning, looks like `bastion.wmcloud.org` does not respond to ssh anymore [07:55:14] morning! [07:55:43] hashar: works for me, can you share more details? (`ssh -vvv` or similar) [07:56:32] dcaro: it works now! :) [07:56:43] xd heisenbug! [07:56:45] previously port 22 was not even answering [07:57:08] thank you for the fix dcaro ! :b [07:57:17] there's someone trying to do a user enumeration on ssh (just trying random users), might have triggered an ip block [07:59:34] would it be from machines from my IP block? [07:59:36] ISP [07:59:43] or maybe the system just deny everyone :) [08:02:14] can't exploit any vulnerabilities if you can't access the service :-) [08:21:14] xd [08:55:11] topranks: could you please review this change? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/196 I will most likely split the patch into smaller chunks (for example, adding the IPv4-only network first) [08:55:33] arturo: did you see https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/196#note_135995? [08:55:34] so we can isolate failures when rolling out [08:56:02] taavi: I did not, thanks for the review [09:15:35] I got a large spam of Puppet failure emails for the integratin projects which are quite confusing [09:15:50] Last run log: NOTICE: Applied catalog in 9.26 seconds [09:15:50] Exception: No exceptions happened [09:15:55] so looks like hmm nothing failed? :) [09:16:55] The last Puppet run was at Wed Apr 16 06:47:28 UTC 2025 (148 minutes ago). [09:16:56] hmm [09:18:08] ah that is cause the actual error is not captured somehow [09:21:07] sorry for the noise! [09:22:49] hashar: np, if you think of a better wording for it we can change it [09:35:16] arturo: can I vote for this on irc or do I have to remember how gitlab works :P [09:39:08] hmpf... we have some kind of admission loop, when rebooting lima-kilo it got stuck failing to start registry-admission, because kyverno is not up yet, and kyverno because registry-admission is not up yet xd [09:39:38] registry-admission has a list of namespaces it does not run on for precisely that reason [09:40:26] let me check if kyverno is there or not [09:41:15] dhinus, dcaro: FYI spicerack v10.1.0 is out (no breaking changes compared to v10.0.0 ofc) but there might be things you might find useful :) Changelog in the usual place ( https://doc.wikimedia.org/spicerack/master/release.html ) [09:46:27] it was not there https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/23 [09:46:34] volans: nice! thanks! [09:57:09] this one is really nice! `log: notify the user on IRC when there is a cookbook waiting for input` [09:58:44] +1 love the idea of IRC notifications [09:58:53] dcaro: the registry admission should reject deployments when they are created on the API, no? for things that are already defined, the registry-admission wont evaluate them [09:59:15] if kyverno was already deployed, how would registry-admission prevent it from running? [09:59:24] trying to restart a deployment ended up with denied by webhook [09:59:30] also, we don't have kyverno policies for all namespaces, only for `tools-whatever` namespaces [09:59:41] because the webhook was down [09:59:50] (this is after a reboot of lima-kilo vm) [10:00:21] so when trying to validate the registry pods, fails because kyverno is down, trying to restart the kyverno deployment fails because registry is down [10:01:04] ok! I se enow [10:01:06] see* [10:02:33] there's more stuff going on after the reboot, still figuring it out, so might be that some previous step failed in a weird way that should not have happened to end up there [10:03:07] thx, the notify is behind a config flag, we'll try it and see how it goes, but could be disabled easily [10:03:32] dcaro: I +1'd the patch [10:03:39] thanks! [10:03:50] thanks for researching this [10:04:02] I think lima-kilo reboots have been flaky for a while [10:05:32] dhinus: to upgrade spicerack on cloudcumins currently we just apt install --upgrade it? [10:05:44] yep [10:05:48] just apt-get install spicerack is enough [10:05:48] yes [10:05:59] yes, you don't even need --upgrade [10:06:39] okok, I'll do and test some [10:06:44] (locally all is good) [10:06:53] <3 [10:06:55] thanks [10:09:08] everything went well :), created silence correctly and such, will merge the patch [10:11:03] volans: thanks for the patches! [10:11:29] great, sorry for the disruption (I really thought I had checked the wmcs-cookbooks repo before the release, dunno what happened) [10:15:56] arturo: hey I was looking at the static-routes needed to support those new IPv4 ranges [10:16:05] for the new vxlan based networks [10:16:26] ok [10:16:30] I had an idea about how best to approach it, see here in Netbox for example: [10:16:31] https://netbox.wikimedia.org/ipam/prefixes/83/prefixes/ [10:16:46] I have created two new 'container' prefixes, one for eqiad and one for codfw [10:16:54] 172.16.0.0/17 and 172.16.128.0/17 [10:17:06] that's cool [10:17:11] my feeling is to route these to the cloudgw in each case on the cloudsw [10:17:29] right, so only 1 route needed on each [10:17:54] the other option is obviously to add new sepcific routes for the 'flat-ipv4only' and 'flat-dualstack', adding to the one that's there for the legacy range [10:18:00] yeah basically just so there is only one range/route [10:18:12] and also so any future networks you add doesn't need any change in routing on the cloudsw side [10:18:25] that sounds good to me [10:18:27] (worried I'll forget it one day lol!) [10:18:33] :-) [10:18:34] ok I'll take a look [10:19:15] dcaro: you can see an example of notification right now in -operations btw [10:19:25] 👀 [10:19:42] how does it know the nick name mappings? :-) [10:20:35] it doesn't, too complex, IRC nicks are dynamics (_ on reconnections, etc...), it uses your shell name that it's something most people highlight already for, hopefully [10:20:50] fair [10:21:03] but I'll be sending a communication for various things and will mention this one too [10:21:26] too bad logmsgbot can't do pvt messages [10:26:11] :-) [11:08:03] arturo: if you get a moment: [11:08:04] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1136980 [11:12:00] Raymond_Ndibe: I'll update https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/96 with the not sending it when the user does not specify behavior we talked about [11:12:24] (that way I can test the api changes on top of it) [11:21:18] arturo: ok +1 for your patch, we need to merge my homer one before this to ensure it goes smoothly [11:34:06] topranks: ack [12:10:48] arturo: and all, I'm merging those on the eqiad CRs now [12:10:57] topranks: ack [12:10:58] will start with CR1 and check all looks ok [12:14:50] ok [12:19:31] arturo: that seems fine so far [12:19:44] topranks: ok [12:19:46] another quick review actually pushing to the cloudsw side there is some clean-up needed from last week still: [12:19:47] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1136995 [12:20:01] those statics are already deleted, homer is trying to put them back [12:20:09] hence the above patch [12:20:31] topranks: ack, +1'd [12:20:47] thanks <3 [12:22:46] topranks: this is the split from the earlier patch: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/198 if you +1 I will merge now [12:25:23] thanks! [12:25:27] arturo: did my best, not sure if I can give you the proper +1 though [12:25:30] https://usercontent.irccloud-cdn.com/file/IDHbVOrq/image.png [12:25:35] that's ok [12:25:35] "approve" is greyed out? [12:25:37] ok cool [12:25:49] cteam: heads up, merging neutron network change in eqiad1 [12:25:57] btw I've another incoming for you, just another niggle of some cleanup from last week [12:26:04] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1136996 [12:27:17] arturo: will that merge make the change live? I've only done cloudsw1-c8 and cr1, probably best to hold off till I do the other two [12:27:19] I'll do them now [12:27:43] topranks: ok, I'll revert. Yes, merging tries to make it live [12:27:57] ok... no I don't think you need to revert [12:28:10] I'll just get mine in now, should work via the single path I think [12:28:16] ok [12:28:32] arturo: ack [12:28:51] dang it failed [12:28:52] {"NeutronError": {"type": "TunnelIdInUse", "message": "Unable to create the network. The tunnel ID 9 is in use.", "detail": ""}} [12:28:53] fixing that [12:33:51] topranks: the new neutron network was just created [12:34:24] everything seems green, so I'll leave it like that for some time [12:34:32] before introducing the dualstack one [12:34:33] arturo: ack, both CRs are now receiving the new /17 aggregate too [12:34:40] cool [12:57:03] arturo: last related patch here: [12:57:03] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1137002 [12:57:06] no rush on this one [12:57:12] 👀 [12:57:38] +1'd [12:58:09] thanks! [12:58:09] * arturo food time [13:15:57] andrewbogott: do you remember what was the current status with toolsbeta.org domain? [13:17:07] It's registered but I haven't added it to designate. I think the phab task is up to date. [13:17:58] okok, I'm kinda eager to be able to just ssh login.toolsbeta.org xd [13:44:30] andrewbogott: I might need your help with a weird Cinder issue: T392089 [13:44:30] T392089: [cinder] Volume failing to attach/detach - https://phabricator.wikimedia.org/T392089 [13:48:33] dhinus: I haven't tried, but most often you can get that unstuck with 'openstack attachment list' and 'openstack attachment delete' -- want to try or want me to just do it? [13:50:07] tried that, it times out :( [13:50:30] I can't find an error in logstash either [13:51:00] that's new [13:51:45] so it was initially "reserved" after the user tried to attach it, I tried "volume attachment complete" and that maybe made things worse :D [13:52:09] because it's not "attached" but it's not showing in the VM, and I cannot unattach [13:52:36] *it's NOW "attached" in the list, but it's not working [13:53:46] arturo: FYI https://phabricator.wikimedia.org/T392094 [13:54:05] I meant to raise this a few weeks back when we added the routes for the v6 ranges [13:54:16] it's not too important but best to tighten it up this way if we can [13:56:08] 👀 [13:57:05] "ERROR cinder OSError: write error#0122025-04-16 13:49:55.773 1316013 ERROR cinder" that doesn't seem good [13:57:30] andrewbogott: :-( network feels fine at the moment [13:57:40] andrewbogott: hmmm where did you find that one? [13:57:44] yeah, I don't think it's related to the network [13:57:48] dhinus: logstash cinder logs [13:58:09] I was searching using the volume ID and that one didn't come up [13:58:17] what search string did you use? [13:58:32] No search string, just looking at the latest messages [13:58:38] I'm restarting cinder services now [13:58:43] just in case! [13:58:44] is there a "cinder" dashboard in logstash? [13:59:26] I usually go to "OpenStack eqiad ECS" then filter with search [14:00:03] yep, just edited 'OpenStack Services' to select cinder only [14:00:23] So yeah, the scheduler is throwing that write error and not responding to the api server, hence the timeout. [14:00:27] Not corrected by restarts. [14:00:44] :( [14:01:06] meeting time, I'll continue to investigate after [14:01:17] did you confirm that cinder works for other volumes? [14:04:22] I haven't tried other volumes no [14:25:01] I created/attached/detached in another project and it worked fine. So probably this is specific to that one volume. [14:28:10] ok that's good [14:29:51] happen to know if the data on that volume is of value? [14:30:37] cristian just wanted to check if there was some data there that could be useful to resurrect the VMs that were deleted in the buster purge [14:30:55] so _maybe_ there is data of value, but he's not really sure [14:31:14] ok. I'll do my best to preserve it :) [14:31:31] So it was attached to a VM that I deleted? [14:32:20] also not sure, let me ask [14:32:59] not a big deal, just wondering how we go there [15:48:06] andrewbogott: T392116 [15:48:07] T392116: horizon: service account has role in project but it doesn't shows up in horizon - https://phabricator.wikimedia.org/T392116 [15:48:47] you're logged into horizon as the service user? [15:49:02] yes [15:49:47] did you shift-reload? [15:49:47] I have tested the usual: page refresh, try on a different browser, a different private tag, etc [15:49:55] ok :) [15:53:09] andrewbogott: it shows up now! [15:53:20] maybe I was too impatient? [15:53:27] dunno, I removed it and re-added it [15:53:33] that sure shouldn't matter though [15:54:55] thanks! [15:55:27] will close the ticket [16:02:24] can I get a +1 here? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/200 [16:04:09] lgtm [17:55:21] Does anyone have time to look at quarry? T392107 [17:55:21] T392107: quarry.wmcloud.org: "This web service cannot be reached" - https://phabricator.wikimedia.org/T392107 [17:55:57] I'll look [17:56:01] SD0001 says there are Redis failures in the logs [17:56:34] * bd808 is not sure which logs [18:01:50] this is all rolled up into a k8s deploy so it will take a bit for me to untangle [18:02:44] thanks for looking andrewbogott. I'm going to lunch, but I can try to help later if you end up stuck or needing more hands. [18:02:52] thx [18:03:01] I can certainly redeploy, not sure what state is saved or wiped out by that [18:03:04] probably safe [18:22:30] taavi: bored on the train by chance? [18:22:38] on a ferry! [18:22:42] better yet! [18:22:58] that feels like a loaded question.. what do you need? [18:23:06] I mean, assuming you're on a train that is itself on a ferry [18:23:23] taavi: quarry is broken, could use some pointers understandiing what's happening. [18:23:33] https://www.irccloud.com/pastebin/aZhzLQtz/ [18:23:53] I tried to kill one of those nonexistent web pods and the command hung [18:24:18] this is a magnum cluster i guess? [18:24:29] yeah [18:24:30] where i can run kubectl against that? [18:24:46] k8s is killing those web pods over and over because the health check fails. So I blame the network, somehow [18:24:55] quarry-bastion.quarry.eqiad1.wikimedia.cloud [18:25:22] export KUBECONFIG=/home/rook/quarry/tofu/kube.config [18:25:30] https://phabricator.wikimedia.org/T392107#10749254 suggests that the disk is full, which would certainly do that [18:25:43] hmm, apparently i am apparently not a member of that projects [18:25:50] * taavi reaches for the yubikey to fix that [18:25:57] ooh that would fit [18:26:19] do we need a shell on the worker node to fix? [18:26:39] Could also tear down and re-deploy, assuming that's not disruptive [18:26:44] give m a second [18:27:11] ok! [18:27:17] you can have several [18:28:20] I think is that yep [18:28:22] │ ReadonlyFilesystem False Wed, 16 Apr 2025 18:27:26 +0000 Thu, 06 Feb 2025 14:55:44 +0000 FilesystemIsNotReadOnly Filesystem is not read-only │ [18:28:33] wait no, that wording is confusing xd [18:29:33] andrewbogott: how do i get a shell on the worker? [18:29:54] I'm not sure it's possible, they aren't built with keys by default as far as I know [18:29:59] hmmmm [18:30:14] I was wondering the same thing xd [18:30:16] hence 'tear down and rebuild' [18:30:28] my guess is that those have filled up their disks [18:30:33] could try rebooting them first [18:30:35] If Rook is lurking they'll know for sure [18:30:52] want me to soft reboot the workers? [18:31:03] sure [18:31:40] done [18:33:30] quarry-127a-g4ndvpkr5sro-node-1 is the problematic node btw [18:33:43] we can try draining and undraining [18:33:54] not sure it will help long run though (or how much it will last) [18:34:17] still rebooting [18:34:19] that doesn't seem to have cheered up those pods at all [18:35:21] this seems to get the disk usage data [18:35:24] https://www.irccloud.com/pastebin/PmeTcOjQ/ [18:35:28] oh at least I got my old shell back [18:35:59] that looks like lots of free space [18:36:22] yep :/ they rebooted already right? [18:36:35] openstack thinks they did [18:36:52] where do you see that? kubernetes still sees everything there as ContainerCreating [18:37:31] see what? [18:37:50] the free disk space [18:38:02] from the query, the command is in the paste too, ran on the bastion [18:38:12] side note: why does quarry have so many web pods? [18:38:19] kubectl get --raw "/api/v1/nodes/quarry-127a-g4ndvpkr5sro-node-1/proxy/stats/summary" [18:38:41] right now it has 0 web pods :( [18:38:46] still broken: Warning FailedCreatePodSandBox 2m12s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd container: failed to create prepare snapshot dir: failed to create temp dir: mkdir /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/new-1244890645: no [18:38:46] space left on device [18:39:24] I think calico got stuck [18:39:44] calico was able to start on the node 1 [18:39:46] i told kubernetes to delete all the pods in the quarry namespace, and that seems to have done something [18:39:54] and now the others seem to be startring too [18:39:57] quarry's back i thinl [18:40:02] nice \e/ [18:40:05] I've never tried this before but I can add a third node to the cluster, then potentially we can kill the troubled one [18:40:06] oh! [18:40:09] hm [18:40:19] that's what I started with (deleting troubled pods) and lost me my shell [18:40:28] looks happy [18:40:50] web is responding again :) [18:40:50] the reboot must've done something [18:41:02] the command above still reports plenty of space [18:41:25] so it seems the space issue was "solved" by the reboot, then the pod delete got them unstuck? [18:41:29] not sure if it was the reboot or my `kubectl delete pod -n quarry --all` that actually fixed it [18:41:35] Does k8s have some kind of backoff where after trying/failing to reschedule those pods a bunch it despaired? [18:41:48] i don't think so [18:42:06] I think it does, would have to double check, but iirc it even gives up for a while [18:42:21] I think reboot+delete is the explanation. But why does reboot free space exactly? [18:42:37] it might clean up old container logs, non-used images, ... [18:43:00] anyway, i have a few follow-ups: [18:43:00] * convert the redis deployment to be a statefulset and give it persistent storage [18:43:00] * figure out why the web component has 8 replicas, and maybe reduce that to a more reasonable amount [18:43:32] * T392138 [18:43:33] T392138: No alerting for quarry - https://phabricator.wikimedia.org/T392138 [18:43:40] We shouldn't have learned about this outage from a user [18:43:47] 👍 [18:44:06] taavi: want me to open tasks for those followups or did you already? [18:47:17] btw. I've started putting the OKR report thingies here https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Ongoing_Efforts/Toolforge_Workgroup/Reports#2025 , wdyt? [18:47:44] andrewbogott: please do [18:48:01] taavi: are you on one of those Scandinavian ferries that serve largely as a tax dodge for passengers? Is the whole ship one big cheap-whiskey-fueled brawl? [18:48:37] dcaro: public progress reports are good! [18:50:52] let me know if other places would be better, or similar, I'll try to keep it updated though it takes some time [18:51:09] andrewbogott: yeah, this is one of those ships where "tax free" generally means "the prices are the same, but we get more profit" [18:51:38] OK, so that's like air travel in the US [18:51:43] T392143 T392141 [18:51:44] T392143: Quarry: Why so many web pods? - https://phabricator.wikimedia.org/T392143 [18:51:44] T392141: Update quarry redis deployment - https://phabricator.wikimedia.org/T392141 [18:54:26] are you following up on T392107 as well or should I? [18:54:27] T392107: quarry.wmcloud.org: "This web service cannot be reached" - https://phabricator.wikimedia.org/T392107 [18:55:17] * dcaro off [18:55:27] thanks dcaro [18:55:35] we can close 2107 can't we? [18:55:41] yeah [18:55:43] thanks andrewbogott taavi for handling the outage! [18:55:43] I will do that [18:56:16] added a not for the team meeting too [18:59:01] great [19:00:05] partying is a thing that definitely happens on these ferries too, but the ferry companies are pretty good at isolating those people from those that just want to get from place A to place B [19:01:48] That's sounds nice for the partyiers and the non-partyers alike. [19:09:24] it's sometimes a bit funny how they do it.. for example it's officially forbidden to consume alcohol bought from the tax-free in the cabins cabins, but there are also (paid) coolers on-board that can be used to cool those drinks [19:11:50] I guess you're supposed to cool the drinks that you brought with you? [19:12:25] that's also forbidden! [19:12:56] welp [19:13:55] (not that either of those bans usually stop people) [22:05:45] andrewbogott: a horrible thing I realized today, Bullseye EOL is September. Have y'all started thinking about the migration push yet? [22:18:13] Thinking but not much beyond thinking [22:26:15] It snuck up on me :/ [23:59:43] quarry is crashed again