[00:26:14] dduvall: did you give your auth token the "Unrestricted (dangerous)" permission? That is needed for some of how the magnum cluster is built. [00:26:34] s/auth token/application credentials/ [00:27:21] very hidden at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Magnum#Provisioning_with_OpenTofu [07:22:15] greetings [07:56:00] morning [08:12:09] hello [08:16:05] dcaro: re: T403684 was there a known/expected cause or a blip ? [08:16:05] T403684: HighIOWaitStalling High iowait detected on clouddumps1002:9100. - https://phabricator.wikimedia.org/T403684 [08:16:33] Expected mostly, periodically dumps does the syncing and such, and that triggers this alert [08:16:48] if it's not sustained it's ok [08:17:14] objections if I look into the alert to make it less noise prone ? [08:17:32] last action hero for my clinic duty week [08:17:35] I think though that we should stop getting those alerts at all, I think now some other team manages dumps? [08:17:56] sure :) [08:18:09] iirc there was already a 'high iowait for long' or similar, but not sure [08:18:58] ah yeah there is indeed sth like that, I came across it for the icinga migration [08:19:11] re: owner I don't know tbh, puppet says wmcs is the role owner tho [08:20:03] bbiab [08:21:32] maybe dhinus remembers if the handover was completely done [08:24:41] as for all questions regarding clouddumps, the answer is "nobody knows" :) [08:25:22] data-platform-sre have fixed a few things on those hosts, but they're not the official "owners" as far as I know [08:35:14] I think wmcs is still somewhat responsible as it's a "cloud*" host, puppet is listing WMCS as role_contacts but has 3 admin::groups (analytics-admins,dumps-roots and wmcs-roots) [08:36:24] godog: +1 for making the alert less noisy [08:53:44] so the original alert in icinga basically never fired a critical (threshold 10k iowait) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1111340/2/modules/labstore/manifests/monitoring/interfaces.pp [08:54:04] I'm +1 to just nuke the alert tbh [09:07:57] I don't remember that alert ever being useful, so +1 for deleting [09:09:16] LOL godog already suggested deleting it in https://gerrit.wikimedia.org/r/c/operations/alerts/+/1111338 [09:11:23] lol [09:11:47] https://gerrit.wikimedia.org/r/c/operations/alerts/+/1184715 [09:16:01] xd [09:25:59] FYI; the repo sync definition for thirdparty/helm3 as used by kubeadm-k8s-1-30 is broken/outdated/obsoleted: https://paste.debian.net/hidden/3fb15259/ [09:26:17] this breaks the import of the new Jenkins packages for the latest security release [09:26:56] I'll temporarily drop the update definition to unlock, if kubeadm-k8s-1-30 is still in use, could you re-add it with a fixed config? [09:27:04] s/unlock/unblock [09:36:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184719 [09:38:13] yeah it is in use, looking. cc dcaro ^ [09:39:02] oh, I was able to import packages not long ago [09:39:43] (I think) [09:40:26] https://www.irccloud.com/pastebin/tfi8gvu0/ [09:40:40] puppet seems broken looking [09:41:30] Duplicate declaration: Class[Firewall] is already declared; cannot redeclare (file: /srv/puppet_code/environments/production/modules/profile/manifests/firewall.pp, line: 47 [09:41:41] the comment in the puppet code was still correct :/ [09:41:43] sigh, that's me [09:41:53] what host was it dcaro ? [09:42:11] dcaro: the helm hepo has moved https://github.com/helm/helm/issues/31082#issuecomment-3247614600 [09:42:13] I checked coludinfra-idp-1, but there's many (check the alerts) [09:42:24] thank you, checking [09:42:28] https://alerts.wikimedia.org/?q=team%3Dwmcs [09:42:40] dhinus: thanks! [09:49:12] do you know how profile::wmcs::firewall gets included? I can't find it in production puppet and I think it might be the culprit [09:49:33] anyways I'll revert for now [09:49:38] sorry for the noise [09:50:48] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184724 [09:50:51] interesting I also cannot find an include [09:50:58] by default VMs don't run a host-level firewall [09:51:15] I've setup the jenkins apt source in a systemd nspawn container to fetch the new jenkins deb, then you can update the helm apt config w/o the revert above [09:51:39] moritzm: ack thanks [09:51:51] roles that do run that include profile::firewall, profile::wmcs::firewall is sometimes included in per-host hiera but is generally rarely used [09:52:24] mmhh that might be it then taavi [09:53:22] the ferm::* defines use virtual resources that makes those resources only do anything if the ferm class is present [09:53:34] the firewall things are a bit more complicated [09:53:38] I forgot to check instance-puppet, but even there it's only in "cloudinfra/enc-" [09:55:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184726 I think this should be the new repo (I'm always a bit unsure of the url I should use and such) [09:56:19] so practically I think profile::wmcs::firewall is only used in 2 vms (enc-3.cloudinfra, enc-4.cloudinfra) [09:57:28] heh I'm looking at instance-puppet.git too and I'm not getting how e.g. tools-proxy-10 gets profile::wmcs::firewall, assuming the include profile::firewall there is the culprit [09:58:36] * taavi is not seeing the connection between the errors and `profile::wmcs::firewall` [09:59:07] cloudinfra-idp-1 gets it from profile::firewall [09:59:10] https://www.irccloud.com/pastebin/w1Wsc1L5/ [09:59:42] yeah maybe a red herring taavi [10:00:18] as in it does not have the p:wmcs::firewall [10:01:04] I don't get it atm, oh well [10:01:10] revert is rolling out [10:04:50] I think the cleanest fix would be to add a Hiera lookup in profile::wmcs::instance for profile::firewall::provider and only add the definitions if set to ferm or nftables [10:07:09] mmhh provider defaults to ferm in cloud afaics [10:08:37] anyways I'll look deeper into the firewall issue this afternoon [10:15:18] moritzm: got the helm url update here https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184726, feel free to merge that or your removal and we'll merge that later [10:15:19] * dcaro lunch [10:17:15] already +1d :-) [10:18:54] the default for cloud is 'none', that should work AFAICT [10:21:26] mmhh I see this in puppet.git though [10:21:29] hieradata/cloud.yaml:profile::firewall::provider: ferm [10:23:45] * godog lunch [11:38:42] merged the helm repo change, it's failing the signature [11:38:56] https://www.irccloud.com/pastebin/BUwbrDLP/ [11:41:23] I downloaded the signature and got the key id following https://wikitech.wikimedia.org/wiki/Reprepro#Adding_a_new_external_repository [11:41:29] maybe the filename is wrong? [11:46:37] hmm... [11:46:46] tried downloading the key through gpg, it gets the same key [11:47:02] https://www.irccloud.com/pastebin/2gZPnXye/ [11:47:46] puppet did pull it [11:47:50] https://www.irccloud.com/pastebin/rHnyLaQ7/ [11:48:55] and it does find it in apt1002: [11:48:58] https://www.irccloud.com/pastebin/mikGyOl9/ [11:49:35] moritzm: any ideas? ^ [11:50:18] interesting though, that the key was downloaded from `https://packages.buildkite.com/helm-linux/helm-debian/gpgkey`, not packagecloud.io [12:00:14] it sounds like their instructions are off and they documented a different key than what they used to sign the repo with? [12:01:07] might be [12:01:23] which key is it expecting? [12:01:37] maybe the old one? [12:04:59] i think you need the key id of the subkey, and not the root key [12:13:34] ack, that'd be 4B196BE9C4313D06 [12:13:37] sending patch [12:14:52] if it works I'll add a note to the wiki [12:15:43] this should be it I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184754 [12:23:05] that was it! \o/ [12:23:08] https://www.irccloud.com/pastebin/r6MAq5TC/ [12:27:32] done https://wikitech.wikimedia.org/w/index.php?diff=2339271 there might be a nicer way to check though, feel free to update/edit [12:35:40] "Then get the key ID:" < I think you have to do it before running the reprepro command? [12:37:05] oh yes, that was already in the wrong order xd [12:38:04] swapped it now :), thanks! [13:08:19] @dcaro i am about to swap that Nic on cloudcephosd1052 that you mention has not been deployed do you need to silence or do anything before i take it down? [13:09:20] yes I think, let me do so [13:18:24] Thanks let me know when I can shut it down [13:22:17] sorry, one sec xd [13:25:20] jclark-ctr: done 👍 [13:43:25] finished physical just waiting for provision to finished so it will set ports in bios correctly for the new card [13:57:17] * andrewbogott is back, catching up on email before the meeings [14:00:42] oh no, okta [14:02:53] komla, weekly meeting? [14:13:27] @dcaro finished Thanks for your help today [14:53:59] yw! [14:59:37] bd808: that worked! thank you [15:02:29] now the creation is seemingly hanging. i'm wondering if it's because the floating ip quota has been reached in testlabs [15:02:45] dduvall: indeed, the failure case for running out of floating IPs is very hangy [15:02:59] dduvall: awesome. I have been meaning to work on some docs for magnum things. https://wikitech.wikimedia.org/wiki/User:BryanDavis/OpenStack#Magnum is my tiny start on that. [15:03:03] let me see if I can free one up for you... you'll need to start your deploy over though [15:03:22] andrewbogott: awesome, ty [15:03:37] it might need two, i'm not sure [15:04:50] well, actually, I'm confusing the two different magnum deploys; I'm not sure eqiad1 magnum deploys need floating IPs at all. But I'm also trying to attend the monthly meeting so may be talking nonsense [15:05:26] Anyone mind if I delete networktests-* in the 'testlabs' project? As far as I know those were arturo's testing hosts [15:05:30] taavi: ^ opinion? [15:05:33] i have ` floating_ip_enabled = true` so i assume it needs one [15:05:51] I have just turned off floating ips and load balancers in the clusters I have built so far [15:06:07] oh, ok. My advice is to not do that, you can tack a proxy or load balancer in front after the fact [15:06:13] yeah, i was trying to avoid the setup of haproxy and stuff (for testing) [15:06:30] for the zuul clusters I added my own haproxy gateway using ipv6 [15:06:59] can you not just use a regular cloud-vps proxy in front of your cluster? [15:07:08] for testing you can also just tunnel into the project and then talk to the k8s nodes from the internal network [15:07:42] or that ^, a webproxy pointed at port 6443 on the master [15:07:56] a regular vps proxy wouldn't be too bad. anything i can easily provision from tofu [15:07:59] that does automatic ssl [15:08:22] * bd808 looks for the example when he had tofu doing a webproxy [15:08:48] i'm trying to port gitlab-cloud-runner to use wmcs with minimal divergence from the digitalocean stuff [15:09:01] dduvall: I'm going to hold off freeing floating IPs until I hear back from Taavi about those test VMs but I can bump up the quota if it turns out you need it [15:09:02] almost there it seems [15:09:09] andrewbogott: no problem [15:09:20] dduvall: https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/commit/a598449dd10ac26e76584eb68baa5615711b9fa2 [15:09:32] i guess i can attend the staff meeting instead of hacking :) [15:09:43] bd808: ah, thanks! [15:10:19] Also (tangent maybe?) codfw1dev magnum is running a new driver that has pretty good load-balancer integration. So I'm in the market for someone who wants to test deploys there; everything running there will be running in eqiad1 in a month or two. [15:16:41] jclark-ctr: I can't reach cloudcephosd1052.eqiad.wmnet, did it come up ok? [15:17:03] it was up [15:17:07] will look [15:24:05] @dcaro so i did only look at mgmt. I do not have root access so it might need /etc/network/interfaces file updated. otherwise 2nd option i just reimage it provision only took care of bios settings [15:24:38] hmm, was it tricky to provision? if so the interfaces file sounds tempting xd [15:24:46] andrewbogott: ^ do you have any preference? [15:25:01] provision was just setting correct port to pxe http for uefi [15:25:38] reimaging w/bullseye is a bit of a pain but with bookworm it should just work (and then we can leave it empty until we get our cluster to bookworm) [15:25:53] sounds like a plan :) [15:27:28] andrewbogott: there's a bunch of testing things that rely on those VMs, so you can either delete all of that, or none of it, but please do not half-delete things and leave the other stuff in a broken state [15:28:04] taavi: that's fine, if they're used for ongoing testing then I'll leave them [15:28:13] thx for the response [16:29:01] * dhinus off [18:21:11] andrewbogott: the cluster creation (id `34802dd2-10d8-4bd7-a1f6-99dd5ada9094`) still seems to hang. or rather, i don't know what it is doing but it's been about 15 min [18:21:39] I'll take a look [18:21:53] seems like it has created the instance for the master but nothing else yet [18:21:56] ty! [18:22:53] also, i'm a total openstack cli noob so let me know what i can do to get more debugging info [18:23:47] debugging will involve digging in Heat which I can talk you through but the Heat driver is on the way out so anything you learn won't help for long [18:24:23] the basic Heat object is called a 'stack'. So you can see individual components with 'openstack stack list' and 'openstack stack show' [18:25:31] so for instance I'm doing 'openstack stack resource list 9189a84e-0bb2-43d3-8baa-080c9f2ba5ed' which shows all the bits of the deployment and which are done and which aren't [18:25:52] i don't seem to have the `stack` subcommand [18:26:47] ah ha! `apt install python3-heatclient` gives it to me [18:28:02] you just want 'openstack stack' [18:28:11] oh, I see. yeah, that'll do it :) [18:30:17] looks like it's stuck deploying k8s to the node... [18:30:36] huh, ok [18:30:40] where do you see that? [18:31:24] way down in the stack, with openstack stack resource list 5e96ec98-23a3-4d96-833d-e51d79071946 [18:31:30] It's stacks in stacks in stacks) [18:31:56] oh i see! [18:32:05] stacks on stacks [18:32:08] yeah [18:32:33] but now I'm trying to figure out how to know what it's actually doing (without logging into that VM which may or may not be possible) [18:32:50] the best solution, abstract all of it [18:34:26] https://logstash.wikimedia.org/app/dashboards#/view/3ef008b0-c871-11eb-ad54-8bb5fcb640c0?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-24h%2Cto%3Anow)) [18:34:36] dduvall: did that link survive the trip? That's the other thing I'm looking at [18:35:06] although tbh the logs seem like they were happy until 15 minutes ago and then just stopped [18:35:48] yeah, that's odd. i just started trying to provision this cluster this morning (well, a bit yesterday as well) [18:36:52] i see those rates over the past 7 days as well, so i'm guessing they're not related to (or caused by) what i'm trying [18:37:51] ummm that sounds like a 'no, the link didn't survive the trip' [18:38:13] because I'm looking at logs filtered down to just the creation of that one cluster [18:38:20] ah heheh [18:38:34] <3 those opensearch links [18:42:16] So options here are probably a) wait for a timeout to show us an error message or 2) delete and try again [18:42:39] or 3) let me set you up on the test cluster so you can try with the upcoming magnum driver (presuming that this is a proof-of-concept that you don't need live for a while) [18:43:57] right on. i'll go with option 1 for now and then perhaps option 3 if i keep running into problems [18:44:02] thanks for the help! [18:44:49] sorry that it's flaky. It's a very fragile operation so hard to know if something is bad in your config or if some random doohicky just timed out at the wrong time [18:46:12] hey no problem [18:56:06] neat, i was able to login to the master node with the keypair i used and `core@{instance hostname}` [18:58:18] looks like the kubelet is just barfing over and over [19:05:42] oh you set up a keypair, nice :) [19:06:00] can you tell why it's failing? [19:06:14] actually, i'm not sure kubelet is failing [19:06:21] the api server seems to function [19:06:37] i can do `kubectl cluster-info dump` for example [19:08:32] the deployment has finally declared itself a failure, says [19:08:35] https://www.irccloud.com/pastebin/JbgJMS66/ [19:08:39] not super helpful! [19:14:55] :( [19:15:07] i guess acceptance is good [19:15:19] i'll kick it off again and see if i can debug further [19:16:18] a second try might just work :/ [19:36:49] dduvall: looks like it's stuck on that same step. Can you tell what it's doing if you ssh in? [19:38:26] my best guess atm is that `/var/lib/heat-config/heat-config-script/8d94e769-9fe7-46c8-8285-99b2a1d35407` is stuck trying to patch the master node [19:38:45] https://www.irccloud.com/pastebin/9HLuyyzF/ [19:39:32] that seems easy [19:39:45] haha yeah [19:41:30] wish i had strace but i don't even know how to install things on Fedora CoreOS [19:45:27] i also see an admission webhook failure [20:15:35] andrewbogott: looking at that logstash link you dropped for d.duvall, I'm seeing lots and lots of auth failures for R.ook's disabled account -- https://logstash.wikimedia.org/goto/0d3cd4f1c365fb0b9ed7f3a85d1329b9 -- Lookup on the request id for related logs makes me suspect the requests are coming from the superset-127-jxhvhh7bzlrl-node-0.superset.eqiad1.wikimedia.cloud instance. [20:16:06] not sure if this is a known thing, but it seems weird [20:18:38] yeah, it's harmless but I should do another round of log cleanup soon