[01:49:41] vola.ns, something is screwed up with live migration in codfw1dev (the cloudvirtxxxx-dev hosts) which I need to investigate. The cookbook should work fine for the other remaining cloudvirts. [06:08:51] greetings [07:10:54] ok I got this out https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/81 [07:11:18] pipelines are getting timeouts when fetching from github, not sure yet if the issue is on our side [07:11:37] e.g. https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/jobs/624235 [07:17:01] that's unfortunately a rather recurring issue [07:17:14] I tried to fix that with T403028 but that did not work as well as I hoped [07:17:15] T403028: toolforge tofu-provisioning: Cache terraform-provider-openstack binary somewhere - https://phabricator.wikimedia.org/T403028 [07:18:14] ack, thank you for the context [07:19:25] do you know or find out if it is gh timing out kinda on purpose ? [07:21:55] b.d808's theory is that it's some sort of throttling against our NAT egress address. although I am a bit sceptical since I suspect we'd heard from soneone else if they were hitting the same issue [07:23:17] yeah it is odd, I'm failing to reproduce it so far from a random cloudvps vm [07:24:09] anyways I can see the rabbit hole from here, not today [07:42:40] morning [07:56:12] ok so I'll go ahead and merge https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/81 then the apply will happen automatically (?) [07:59:49] iirc it will ask for your input before applying (in one of those gitlab manual steps) [07:59:54] "manually run the last step in the main branch pipeline: tofu apply" [08:00:01] from the readme [08:00:22] ah! got it, thank you I missed that part [08:00:24] will do now [08:00:59] https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/pipelines/136220 [08:02:57] thanks andrewb.ogott, I'll keep going through the list [08:07:00] mmhh things are not going as expected, a new volume is being created, debugging [08:08:29] oh, I think toolsbeta might have gotten stuck [08:08:39] of course volumes are also managed by opentofu in toolsbeta [08:08:43] yes that'd be me [08:08:51] ack [08:09:01] let me know if I can help [08:10:00] ok will do! currently trying to understand what happened [08:12:03] so the 10gb volume toolsbeta-nfs was disconnected from toolsbeta-nfs-3.toolsbeta.eqiad1.wikimedia.cloud by the cookbook as expected, though it is now in status "reserved" [08:12:22] and the cookbook/openstack refused to attach it to toolsbeta-nfs-4.toolsbeta.eqiad1.wikimedia.cloud [08:12:37] BadRequestException: 400: Client Error for url: https://openstack.eqiad1.wikimediacloud.org:28774/v2.1/servers/19c9ecd1-6fb2-4a2d-954a-c1dc6c956034/os-volume_attachments, Invalid input received: Invalid volume: Volume 648504db-18c2-4cee-b731-567dcb4dadf6 status must be available or downloading to reserve, but the current status is reserved. (HTTP 400) (Request-ID: [08:12:43] req-62a05c8b-e50e-42f0-84de-44c00c8773b9) [08:13:58] the internet™ suggests openstack volume set --state available [08:14:17] no idea why it got in status reserved in the first place though [08:15:24] no idea, looking [08:15:47] why are there two toolsbeta-nfs volumes? [08:16:45] taavi: no idea [08:17:19] the other one (4gb) is ab8ddafd-f8f9-41e3-8a79-d985e962a2ee and description "Ignore this one, not sure where it came from" [08:17:43] anyways I'll try to force the volume in state available [08:18:38] morning [08:19:49] of course the cookbook expects things in a certain state now, so re-running won't work [08:20:39] ok I'll do things manually instead [08:22:28] I think I've seen the "status is reserved" before but it's quite rare, not sure what caused it [08:24:00] * godog nods [08:24:30] ok the cookbook bits I've done manually, namely reattach the volume to the new host and flip 'profile::wmcs::nfs::standalone::cinder_attached' [08:24:33] it rings a bell too, I think there was a way to track it on the DB, with reservations and such [08:24:43] now to get https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/pipelines/136220 to succeed [08:25:06] I'm open to ideas, currently failing due to timeout while installing the openstack provider per T403028 [08:25:07] T403028: toolforge tofu-provisioning: Cache terraform-provider-openstack binary somewhere - https://phabricator.wikimedia.org/T403028 [08:25:59] hmm I've never seen it fail so consistently 4 times in a row [08:27:01] usually a single retry would work [08:27:33] no idea what's the best next action tbh [08:29:00] Just One More Retry :D https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/jobs/624360 [08:29:09] lol [08:29:10] from the worker it ran on curl does not complain, though the jwt token expired [08:29:27] ok I'll apply manually the last step [08:30:50] and it failed with the timeout :( [08:31:08] what a bunch of BS [08:31:39] I guess retry again, at least for toolsbeta [08:33:17] ok toolsbeta apply worked [08:34:56] waiting for dns to propagate then I'll start puppet on the nfs servers again [08:40:45] ok done, rebooting the nfs workers in toolsbeta [08:41:49] starting from the bastion actually [08:42:49] I can login :) [08:43:15] maybe a bit early xd [08:44:14] oh ok I couldn't login actually [08:46:18] `Could not chdir to home directory /home/dcaro: No such file or directory` [08:46:33] did it reboot? [08:46:52] https://www.irccloud.com/pastebin/HftrsnyB/ [08:47:06] not yet, I'll stop spamming, let me know when I can test [08:47:26] yes it did reboot, not sure why homes are not there [08:47:50] oh okok, looking [08:48:17] stupid question, is there a way when the cookbook calls wmcs-drain-hypervisor.py to tell it to avoid the hosts to be rebooted next? In most cases it uses the hosts just rebooted because those are empty, but I got a couple of VMs moving to the next hosts to be rebooted [08:49:06] homes are mounted, maybe they are empty? [08:49:07] toolsbeta-nfs.svc.toolsbeta.eqiad1.wikimedia.cloud:/srv/toolsbeta/misc/shared/toolsbeta/home on /mnt/nfs/nfs-01-toolsbeta-home type nfs4 (rw,noatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.18.135,local_lock=none,addr=172.16.18.119) [08:49:37] so that path on toolsbeta-nfs-4.toolsbeta.eqiad1.wikimedia.cloud does have data [08:50:50] dcaro: are you testing stuff on the bastion? I don't want to step on each other toes [08:51:11] I was looking on the nfs server [08:51:26] yep it has data, so maybe not mounted from the right nfs server? [08:51:57] addr=172.16.18.119, and that's configured in nfs-4 [08:52:17] godog: I'm only running "non modifying" commands, you do any modification ones :)L [08:52:27] lol ok thx dcaro [08:52:53] https://www.irccloud.com/pastebin/mc6D1kYs/ [08:52:58] that looks ok [08:53:18] oh, now I see stuff [08:53:25] yeah now it is fine, there might have been a race with the exports and initially /dev/sdb was not mounted / in fstab [08:53:50] ok proceeding with nfs workers reboot [08:54:00] so did you remount or something? [08:54:32] I did remount it yes [08:54:36] ack [08:55:28] other than the nfs workers and the bastion, how to find out what else has nfs stuck ? [08:55:51] volans: I don't think it has any logic for that :/, it's been a while since I used it though [08:57:06] godog: you can try modifying the 'processes in D state' there to catch all VMs in the project https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&from=now-6h&to=now&timezone=utc&var-cluster_datasource=P8433460076D33992&var-cluster=tools [08:57:24] hmm, I think it might already be doing that [08:57:52] good point dcaro, yes that has the whole list afaics [08:57:58] ack thx [09:00:40] ok next roadblock: cloudcumin1001 is failing to talk to https://prometheus.svc.beta.toolforge.org/tools/ and thus wmcs.toolforge.k8s.reboot_stuck_workers doesn't work [09:01:06] I'm assuming some firewalling of some kind since packets seems dropped [09:01:15] taavi: does that ring a bell ^ [09:01:17] ? [09:02:22] you can also use the regular reboot_workers passing the list to get unblocked [09:02:51] true, ok I'll do that for now [09:14:12] for the prometheus access, it seems we will need to set `root@cloudcumin1001:~# export https_proxy=http://webproxy:8080`, with that it works [09:14:59] it could just use spicerack's proxy info? [09:15:10] mmhh am I misremembering running wmcs.toolforge.k8s.reboot_stuck_workers from cloudcumin1001 in the past with no additional settings ? [09:15:21] maybe [09:15:43] (as in I've run it, but I'm not certain it was from cloudcumin or my local) [09:16:07] we have both spicerack.http_proxy (plain) and spicerack.requests_proxies (preformatted for requests) [09:16:18] what does that do? [09:17:05] >>> spicerack.http_proxy [09:17:05] 'http://webproxy.eqiad.wmnet:8080' [09:17:09] >>> spicerack.requests_proxies [09:17:09] {'http': 'http://webproxy.eqiad.wmnet:8080', 'https': 'http://webproxy.eqiad.wmnet:8080'} [09:17:43] and you can pass it directly to requests.get(...., proxies=spicerack.requests_proxies) [09:17:55] for example [09:18:09] is it set in cloudcumin? [09:18:14] where does it get the info from? [09:18:20] yes, puppet [09:18:28] this was spicerack-sheel from cloudcumin1001 [09:18:51] https://www.irccloud.com/pastebin/6OFTHzxN/ [09:18:53] it gets pushed by puppet to /etc/spicerack/config.yaml [09:18:56] config yep [09:19:16] sounds like a good option yep [09:19:38] I also remember we could call prometheus from cloudcumin without problems [09:19:49] so I'm not sure why it's not working now [09:27:29] the donwtime of the safe_reboot cookbook has some problems over 3 hosts that I've done for 2 they alerted in the -feed channel [09:32:57] was it an icinga or am alert? [09:33:41] I think we cannot downtime in icinga so those ones will still trigger. we decided it was fine because... "we're gonna deprecate icinga" LOL [09:33:54] why can't downtime in icinga? [09:34:03] cloudcumin cannot ssh to the icina host [09:34:07] *icinga [09:34:18] (IIRC) [09:34:22] Host cloudvirt1062 is DOWN should be icinga, but also FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent that should be AM [09:34:24] there's a couple yep, unable to get to nrpe [09:34:27] ah right [09:34:58] they have a silence also, but it expired [09:35:27] yes but from a quick glance of the cookbook it sets it to 30m and it took less than that, I'll dig later maybe [09:35:42] godog: puppet is failing in the toolsbeta bastion, with the clouddumps mount issue we saw before :/ [09:35:54] ouch, thx dcaro [09:35:57] will take a look [09:36:06] `Started 29 minutes ago Expired 8 minutes ago` [09:36:12] volans: could it be that after reboot the cookbook immediately deletes the silence? [09:36:14] for the alert, from karma at loast [09:36:35] it shouldn't say expired though no? [09:36:38] if you delete it [09:36:39] https://alerts.wikimedia.org/?q=team%3Dwmcs [09:36:44] I think so yes [09:36:45] you're right, so I don't know [09:37:05] I'll digg when I can, thanks, sorry for the noise [09:37:15] maybe it's 20min the time it uses [09:37:17] np [09:37:27] np, thank you for looking, I remember we had some unexpected alerts in past reboots [09:37:35] but I'm not sure it was the same cookbook [09:38:39] for andrew, I did few other cloudvirts, task updated, you can continue from there [09:38:46] * volans has to go afk [09:39:42] godog: how is it going, should I run the full functional tests suite? (wmcs.toolforge.run_tests cookbook, you can also if you want, takes ~20min currently iirc) [09:40:34] dcaro: so puppet on bastion is fixed, nfs workers all rebooted [09:40:38] alerts went away [09:40:49] going through the list of non nfs workers atm and reboot what's stuck [09:41:06] ack [09:41:10] things are converging though at least [09:41:33] I took notes as I went along, will update T404584 once done [09:41:34] T404584: [tools,nfs,infra] Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584 [09:42:05] we can do toolsbeta again if you had to change any cookbook things [09:42:16] though it would not need to change networks this time [09:43:18] yeah good point, at least we would sidestep opentofu and test the rest, I'll think about it [09:45:32] btw. let's get rid of that second volume https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/82 [09:46:02] is that mounted anywhere? [09:46:35] I don't think so [09:46:38] checking [09:46:48] unattached [09:46:55] burn it :) [09:47:11] +1 [09:47:28] I'm confused by the fact I see two volumes with the same name in https://openstack-browser.toolforge.org/project/toolsbeta [09:47:35] I didn't think it was possible [09:47:47] it should delete by ID anyway [09:48:02] yeah names can be duplicated, ask me how I know [09:48:12] :D [09:50:14] I suspect that there might be a reason why it's not deleted though (so I'm slightly expecting for openstack to fail to delete) [09:51:09] hmm... it seems puppet is failing to run on many toolsbeta vms [09:53:49] godog: sorry missed your ping, but seems like you sorted it already? [09:53:56] on which hosts dcaro ? [09:54:22] taavi: no problem, yeah we were wondering why cloudcumin1001 can't talk to svc.toolforge.org anymore [09:54:32] even though I seem to remember it could [09:54:46] but yes at any rate we should be going through the prod proxy anyways I think [09:54:52] https://prometheus.wmcloud.org/graph?g0.expr=puppet_agent_failed%7Bproject%3D%22toolsbeta%22%7D%0A&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=1&g0.store_matches=%5B%5D&g0.engine=prometheus&g0.analyze=0&g0.tenant= [09:55:00] I'm guessing the dumps thing also [09:55:39] (you can get to that query kinda from the alert in karma) [09:55:42] https://usercontent.irccloud-cdn.com/file/KjKiAzvk/image.png [09:55:59] https://usercontent.irccloud-cdn.com/file/tCHblyLa/image.png [09:56:19] godog: yeah it seems like the reboot_stuck_workers just does a plain requests.get() against the prometheus address instead of using the prometheus wrapper in spicerack which IIRC does the right thing with the proxies [09:56:25] yes my bad I'm used to show only active alerts in karma dcaro [09:56:56] hmm... not sure why mail needs dumps nfs [09:57:29] mail needs user/project nfs for processing .forward files, and the toggle for nfs mounts is all-or-nothing [09:57:30] we might not be able to mount only homes I guess in puppet [09:57:38] yep [09:57:46] taavi: ack thank you that makes sense, I'll add a note to followup on that too [09:59:08] dcaro: looks like to me a (new?) side effect of hard rebooting vms [09:59:22] yep, I had not seen that nfs mount issue before [10:00:26] also yet unclear to me why only clouddumps1001 and not 1002 [10:00:38] but anyways, I'll expire the clients manually for now then take a break [10:00:45] 👍 [10:00:49] I'll go for lunch soon too [10:04:06] volume deleted correctly :) [10:04:16] dcaro: nice, thanks [10:04:43] you got lucky too with toolsbeta pipeline failing and not tools [10:59:21] ty [11:56:13] re mounts being all-or-nothing: T405462 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191005 [11:56:13] T405462: Allow customizing which mounts to enable per VM - https://phabricator.wikimedia.org/T405462 [12:03:38] interesting, it is not clear to me which problem that would solve though [12:04:25] just not mounting the dumps nfs in vms that don't need it [12:04:30] (like tools-mail) [12:05:35] is that the problem? I'm not against it I'd like to understand [12:12:24] yep, that would be what that fixes [12:13:28] I wouldn't call it broken, but cleanup and general problem avoidance [12:13:38] an NFS mount cannot break if it does not exist [12:13:52] agree [12:20:07] fair enough, so which mounts will effectively appear is not only controlled by modules/cloudnfs/data/projects.yaml but also that variable? [12:21:52] I guess I'm wary of complicating an already brittle/complicated situation (from my POV anyways) for relatively little gain [12:22:04] if clouddumps is busted then we'll go around anyways and reboot VMs [12:22:13] s/if/when/ [12:25:46] it seems though that every time you reboot a vm you have to unbust it from clouddumps manually as it is [12:26:08] agree though that it's not a big gain [12:27:30] yes kicking clients from clouddumps1001 after hard reboot is still an open issue [12:27:43] from clouddumps1001 only so far, not 1002, anyways [12:27:54] yep, kinda weird [12:28:17] I see obvious disadvantages vs dubious advantages with the patch, my two cents [12:33:56] hmm, you have a very good point [13:18:24] when pcc says "Hosts that were skipped (fail fast)" that usually means it doesn't have facts for the host? [13:18:36] seeing that more and more lately, maybe because I'm reimaging lots of things [13:18:59] do you have an example? [13:19:07] sure, https://puppet-compiler.wmflabs.org/output/1191018/7536/ [13:19:20] most of the time it's a typo in the hostname but today I think it isn't. Hubris! [13:19:53] "cloudcephosd1035.eqiad.wmnet cloudcephmon1004.eqiad.wmnet" is not an individual host [13:20:04] your Hosts: commit trailer is missing a comma between those two [13:20:32] heh, a different sort of typo [13:20:35] thx [13:25:30] works much better with the , [13:30:13] does it care about spaces? (I hope not) [13:35:15] hmm... I'm finding some tools with weird memory limit/request values [13:35:21] https://www.irccloud.com/pastebin/hC0lM7B3/ [13:35:31] only a few though [13:35:58] is that something a user is setting with a decimal.point? [13:37:04] I suspect might be from very old jobs [13:38:15] tool-os-deprecation has that issue, but the crons are from august :/ [13:38:53] https://www.irccloud.com/pastebin/tiMU8Xkd/ [13:39:03] somehow the 2G got rounded up weird somewhere [13:39:19] that reminded me to look at https://os-deprecation.toolforge.org/ and shudder at the deployment-prep list :( [13:40:02] re-uploading the cron get it ok this time though [13:40:11] might have been a jobs-api thing at some point [13:49:54] seeing these failures in production puppet CI: [13:49:56] 13:46:33 rspec './modules/profile/spec/classes/profile_cloudceph_client_rbd_backy_spec.rb[1:1:1:2]' # profile::cloudceph::client::rbd_backy on debian-11-x86_64 when no ceph repo passed uses correct default is expected to contain Apt::Package_from_component[ceph] with component => "thirdparty/ceph-quincy" [13:50:58] andrewbogott: ^ [13:51:38] wow "when no ceph repo passed uses correct default is expected to contain " is hard to parse [13:51:43] but I think I can fix [13:52:16] very possible the assumptions of the spec tests need to be adjusted [13:53:25] i think the test name may missing a verb [13:53:58] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191027 [13:54:12] I wonder why my patch that introduced the problem passed? [13:55:33] I think it's just missing a `the`, maybe `the correct default repo` [13:56:31] cdanis: patch is merged, lmk if that doesn't fix things [13:59:25] jenkins is happy now, thanks :) [14:02:56] topranks: have time for the sync meeting? [14:03:31] sorry yes just lost track joining now [14:51:46] godog: I'd like to pool cloudcephosd1052 because it has slightly irregular hardware and I want to make sure it works. Is it ok if I reserve 1050 and 1051 for your work and pool 1052? [14:51:58] andrewbogott: sure SGTM [16:20:49] What's with the NFS alert? [16:21:02] godog, that you? [16:21:24] looks like a blip, the url works for me now [16:21:27] ok [16:21:32] and there's the recovery [16:21:42] I'm pooling an OSD node... [16:21:49] hmm, interesting [16:21:55] but only a little bit, and it's been ongoing for >1 hour [16:22:02] so I don't know why it would complain now [16:22:12] then maybe not so much xd, might be the dell disks moving really old data? [16:22:27] might also just be a timeout somewhere in between [16:22:38] toolchecker is not within toolforge though right? [16:22:39] lol, this dashboard is funny https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&from=now-6h&to=now&timezone=utc [16:22:43] that one might be [16:23:01] toolschecker is a separate vm in the tools project [16:24:11] the host I'm pooling shows a disk saturation spike 5 minutes ago https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=now-3h&to=now&timezone=utc&var-server=cloudcephosd1052&var-datasource=000000026&var-cluster=wmcs&refresh=5m [16:24:12] iirc there was some part of it running as a cronjob or something? [16:25:34] andrewbogott: maybe [16:25:47] everything else looks quite ok though [16:25:55] yeah. [16:26:03] there's some lost pings around the network not too long ago, probably a switch getting a bit busy? [16:28:35] nah, it's the reboot of cloudcephmon1005 [16:30:51] that certainly shouldn't have caused NFS issues [16:31:07] I think? There was always a quorum [16:32:16] argee [16:35:08] looks like it might take ~24 hours for things to rebalance after adding this OSD, so It's safe to assume that rebalancing is ongoing during your day tomorrow if you see other blips. [16:38:54] ack [17:35:47] * dcaro off [17:35:49] cya tomorrow