[09:38:10] <taavi>	 anyone interested in reviewing the patches from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155602?
[10:31:26] <dcaro>	 tools-prometheus-8 seems to be having issues (I can get in the console, but it's really slow), I'll reboot
[10:37:37] <dcaro>	 hmm.... something might have gone wrong with the network?
[10:37:42] <dcaro>	 https://www.irccloud.com/pastebin/ciPMCpqb/
[10:39:11] <dcaro>	 it's up and running now though
[10:40:27] <taavi>	 the load on that host spiked before it went unresponsive https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&from=now-3h&to=now&timezone=browser&var-project=tools&var-instance=tools-prometheus-8
[10:40:31] <taavi>	 i doubt that's a network issue
[10:46:06] <dcaro>	 it might be cause by it not being able to write to disk or similar
[10:46:15] <dcaro>	 (io contention)
[11:30:26] <andrewbogott>	 I'm awake early and I've got questions!  dcaro, I'm trying to get the networking setup right on these two tickets T378828 and T394333 but I'm confused about cloud-storage vs cloud-private networks.
[11:30:27] <stashbot>	 T378828: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828
[11:30:27] <stashbot>	 T394333: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333
[11:31:08] <andrewbogott>	 Specifically, I can't find any hosts in netbox that look like they're connected to cloud-storage. I'm thinking either there was a naming change that I missed or else I'm confusing physical networks with virtual networks or something like that. Can you help me understand?
[11:31:46] <dcaro>	 I'll be there in a bit (lunch), we can give it a look
[11:32:02] <andrewbogott>	 ok! No rush, none of this work is getting done today.
[11:32:12] * andrewbogott should eat too
[11:33:15] <taavi>	 andrewbogott: cloud-storage IPs are unfortunately not assigned via Netbox, which is why it looks a bit confusing there
[11:33:38] <taavi>	 in particular, the main place where that network appears is the switch port config, for example https://netbox.wikimedia.org/ipam/vlans/142/interfaces/
[11:34:36] <andrewbogott>	 Yep, that makes sense (and is familiar). I guess I'm confused about what that looks like to dc-ops though.
[11:41:10] <andrewbogott>	 And, oddly unrelated, I'm trying to reimage cloudcephosd1014 (which has been in service for ages) and the dhcp stage is failing in the debian installer. Maybe that's a topranks question?
[11:50:52] <andrewbogott>	 taavi,  I saved a fun puzzle for you in case you don't have enough mystery in your life:  On eqiad VMs, k3s always works fine. On codfw1dev VMs, it never works and complains about overlayfs.  I saved two (to me, identical) VMs, k3stest2.testlabs.codfw1dev.wikimedia.cloud and k3stest1.testlabs.eqiad1.wikimedia.cloud for you to explore.
[11:51:11] <andrewbogott>	 To see it happen, "curl -sfL https://get.k3s.io | sh -" and then "journalctl -fu k3s"
[11:51:40] <andrewbogott>	 (That is not even all of my questions from last night yet)
[11:53:03] <taavi>	 andrewbogott: on the codfw1dev one, `grep -Hrn overlay /etc` reveals there is a `/etc/modprobe.d/blacklist-wmf_overlay.conf` file blocking the required kernel modules from loading
[11:55:48] <andrewbogott>	 whaaaat?  Where did /that/ come from?
[11:56:06] <taavi>	 it has a `# This file is managed by Puppet` comment at the top
[11:56:14] <taavi>	 anyway, if you have more mysteries I can look after lunch :D
[11:57:23] <andrewbogott>	 See, this is why I saved this for you rather than staying up all night trying to figure it out myself.  Have a good lunch!
[11:58:29] <topranks>	 andrewbogott: failing dhcp inside the debian installer is normally a problem with the firmware of the NIC 
[11:58:47] <topranks>	 systems dhcp twice during reimage, once the BIOS/PXEboot stage, and again in the installer 
[11:59:02] <andrewbogott>	 Yeah, and clearly the first part is working
[11:59:04] <topranks>	 if it gets to the installer it means the setup/netbox/switch bit is ok for dhcp 
[11:59:07] <topranks>	 yeah 
[11:59:08] <andrewbogott>	 ok, I'll do a fw upgrade
[11:59:29] <topranks>	 but what we have seen is driver init fails in debian installer, it's a bug in some of the firmware versions 
[12:02:12] <andrewbogott>	 topranks: while I'm bothering you... I think there's a blocker on T393614 about switching the port for another server? I can't find the subtask but iirc it was assigned to you, do you see it?
[12:02:13] <stashbot>	 T393614: Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614
[12:02:16] * andrewbogott still searching
[12:05:00] <andrewbogott>	 ah, I think it's this https://phabricator.wikimedia.org/T396363
[12:08:49] <topranks>	 andrewbogott: ok thanks for the heads up I'd missed that 
[12:09:06] <topranks>	 the port move is disruptive so I guess we need to depool the host before moving the link?
[12:09:33] <andrewbogott>	 Yeah. I think that host is what we're refreshing so probably I can drain it and leave it drained.
[12:09:35] <andrewbogott>	 I'll start that in a bit
[12:09:43] <topranks>	 we can probably do the work in about 5 mins so alternately if things can weather one host link down for that long its an option 
[12:12:08] <andrewbogott>	 I'm telling it to drain, we'll see how that goes
[12:24:39] * taavi back
[12:25:37] <andrewbogott>	 seems like CI is very slow or stopped but for overlayfs... https://gerrit.wikimedia.org/r/c/operations/puppet/+/1156321
[12:26:55] <andrewbogott>	 ah, there we go
[12:28:22] <taavi>	 andrewbogott: did you have any more mysteries for me?
[12:29:14] <andrewbogott>	 there's the one about attaching a second IP to a VM but I don't have the test case set up yet, will do that now.  Meanwhile... +1 for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1156321 ?
[12:29:33] <taavi>	 +1
[12:34:55] <andrewbogott>	 thanks.
[12:35:18] <andrewbogott>	 OK, capi-4.octavia.codfw1dev.wikimedia.cloud has 172.16.131.86 attached and I want to be able to to do this:
[12:35:29] <andrewbogott>	 https://www.irccloud.com/pastebin/ksvBBeCQ/
[12:35:50] <andrewbogott>	 (I don't think the pubkey is actually installed there but that doesn't matter for current purposes)
[12:38:09] <taavi>	 ok, so the second interface is there but there's nothing setting it an address https://phabricator.wikimedia.org/P77832
[12:38:31] <taavi>	 possibly stupid question: why does this need to happen over that secondary interface?
[12:39:55] <andrewbogott>	 taavi: do you mean, why not just have one IP, only on the octavia-mgmt network?
[12:40:29] <taavi>	 more like "why does this need a leg in the octavia-mgmt network?"
[12:41:01] <andrewbogott>	 oh! That's because it's going to be a worker node for magnum, so the magnum services on cloudcontrols need to talk to k8s on it.
[12:41:09] <andrewbogott>	 Could do that via proxies and such instead if need be.
[12:41:29] <andrewbogott>	 I was just thinking, since we already have the octavia-mgmt network and octavia is using it to talk to VMs, etc. etc.
[12:41:34] <taavi>	 and why does that need to happen via the octavia-mgmt network and not the "normal" network?
[12:42:00] <andrewbogott>	 the normal network isn't visible to cloudcontrols
[12:42:05] <andrewbogott>	 but it could be proxied somehow.
[12:42:22] * andrewbogott ready to be surprised when taavi says 'yes it is'
[12:42:40] <taavi>	 cloudcontrols have exactly the same connectivity to both of those networks
[12:42:58] <andrewbogott>	 so maybe that means that we don't need the octavia-mgmt network for octavia either?
[12:43:21] <taavi>	 taavi@cloudcontrol2010-dev ~ $ ssh 172.16.129.116
[12:43:22] <taavi>	 The authenticity of host '172.16.129.116 (172.16.129.116)' can't be established.
[12:43:37] * andrewbogott scowls
[12:43:39] <taavi>	 if octavia can use the regular network, sure
[12:44:30] <taavi>	 somehow i had the impression that that network existed because octavia wanted to be separate from everything else, not because that it needed direct connectivity
[12:44:35] <andrewbogott>	 have cloudcontrols always been able to route directly to the cloud network or did that change when the cloud-private things were introduced?
[12:45:26] <taavi>	 always but with a very giant asterisk, cloud-private made that not a very massive hack
[12:45:39] <andrewbogott>	 ok -- that's likely part of why I'm confused.
[12:46:00] <taavi>	 beforehand it was very heavily firewalled and discouraged
[12:46:01] <andrewbogott>	 So I expect you're right and I'm making this more complicated than I need to. I'll see if I can get capi working over the primary network for now.
[12:46:31] <andrewbogott>	 And will leave octavia as it is since it's already the way the docs and devs want it.
[12:46:51] <andrewbogott>	 I think that's all my mysteries for now. Thank you!
[12:49:14] <andrewbogott>	 oh, nope, I already have a new one (although not especially a taavi question), how do I upgrade firmware on a host that's not in puppetdb? Spicerack tries to do a wildcard match and then says 'spicerack.remote.RemoteError: No hosts provided'
[12:50:03] <andrewbogott>	 oh, probably --new
[12:50:14] <andrewbogott>	 yep
[12:51:28] <andrewbogott>	 grrr now
[12:51:29] <andrewbogott>	 cloudcephosd1014.eqiad.wmnet: skipping DellDriverCategory.NETWORK has no member
[14:12:56] <dhinus>	 can I please get a +1 on this MR? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/247/diffs
[14:15:18] <taavi>	 i generally don't see a point of getting a proper review for those MRs, the project creation was already approved on the task and the patch was created by the script
[14:16:15] <taavi>	 but done
[14:19:04] <dhinus>	 fair point, but for simplicity I like keeping the simple rule "all MRs get a review before merge"
[14:19:31] <dhinus>	 (also I could easily mistype the project description or CPU amount, etc.)
[14:20:30] <dhinus>	 side note: I pressed some keys while the cookbook was waiting at "Enter go when the patch is merged", and it aborted
[14:20:46] <dhinus>	 it's not idempotent so I have to manually complete the remaining steps
[14:23:15] <dhinus>	 ohh there is --skip-mr, nice
[14:24:47] <taavi>	 in general the current system where it's half cookbook and half tofu seems really confusing in my mind
[14:30:30] <dhinus>	 I had a long discussion about it with dcaro where I was initially against having the cookbook at all, but he convinced me there is value in it. :) I'm not happy with the current workflow though, maybe we should start from T385604
[14:30:30] <stashbot>	 T385604: Decision Request  - How openstack projects relate to tofu-infra - https://phabricator.wikimedia.org/T385604
[14:33:56] <dcaro>	 I'm pretty busy lately, so I might not be able to push that task for some time (until beta I guess), if you think it needs more priority let me know and I can try to shuffle things
[14:39:11] <dhinus>	 I can definitely wait a bit longer :)
[14:40:23] <dhinus>	 the workflow worked mostly fine today, there is no big blocker that needs to be resolved immediately
[14:47:05] <dcaro>	 hmm... prometheus tools is down?
[14:47:21] <dcaro>	 prometheus.svc.toolforge.org/tools gives me 503 (same from grafana)
[14:47:22] <dcaro>	 looking
[14:49:26] <andrewbogott>	 dcaro: I just drained and reimaged cloudcephosd1014 and it came up with all the osd lvm  partitions gone.  Is that expected or did I do something wrong? I was thinking that I would just use the 'undrain' cookbook after upgrade but now I'm thinking I need to do wmcs.ceph.osd.bootstrap_and_add instead of just repooling like I expected
[14:49:38] <andrewbogott>	 (in which case I should have destroyed it before rather than just draining I guess)
[14:50:19] <dhinus>	 andrewbogott: I think reimage should not destroy existing partitions, but I might be wrong
[14:50:21] <dcaro>	 andrewbogott: yep, currently the flow is destroy->reimage->bootstrap
[14:50:40] <andrewbogott>	 dhinus: yeah, shouldn't, but does
[14:50:49] <andrewbogott>	 ok, I will see if I can destroy it now that it's post-reimage...
[14:51:05] <dcaro>	 we might be able to tweak the scripts that delete the drives to not do it, but never got around to it
[14:51:07] <andrewbogott>	 I want to refine this workflow but will wait until we can do it in codfw1dev
[14:51:26] <dhinus>	 I'm sure that when I reimaged clouddb hosts, the partitions were still there. but ceph hosts have different partman recipes
[14:51:42] <dcaro>	 yep
[16:10:34] <andrewbogott>	 topranks: you were right about the firmware on cloudcephosd1014. Unfortunately the very next server cloudcephosd1015 won't even pxe boot. I've already done firmware upgrades on it, no change.
[16:15:31] <addshore>	 Is the fact that I can do `become toolname echo foo` expected behaviour? (Just trying to figure out why it doesnt work in lima-kilo
[16:16:08] <addshore>	 Looks like they are using different scripts
[16:17:56] <dcaro>	 addshore: yep, lima-kilo has a silly wrapper only
[16:18:21] <dcaro>	 it's expected behavior, I think we even have some steps that do so in the wiki howtos and such
[16:18:55] <addshore>	 okay great, I might look at tweaking the limakilo wrapper then :)
[16:19:16] <dcaro>	 👍
[16:21:35] <dcaro>	 if you are curious, the prod script is installed with the package https://gerrit.wikimedia.org/r/plugins/gitiles/labs/toollabs/+/refs/heads/master
[16:23:52] <andrewbogott>	 ...I don't see any of the dhcp requests I expect but I might be looking in the wrong place (tcpdump -vvv 'udp and (src port 67 or src port 68 or src port 69)' | grep cloudceph)
[16:25:04] <dcaro>	 addshore: hmm, just manually install that package (sudo apt install misctools), and it seems the become from there does kinda work already :), with sudo `sudo become tf-test echo foo`
[16:25:09] <dcaro>	 maybe we can use that package instead
[16:25:48] <dcaro>	 (no rush at all, can be done later, changing the current script is also ok)
[16:26:32] <dcaro>	 andrewbogott: if you don't grep do you see other requests? (the filter looks ok on a first look)
[16:26:47] <andrewbogott>	 yeah, I see lots of other traffic w/out the grep
[16:27:03] <andrewbogott>	 I guess I need to stop assuming this ever worked and check the bios settings
[17:17:22] * dcaro off
[17:17:24] <dcaro>	 cya on monday!
[22:31:02] <chuckonwu>	 Seeking reviews on https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/87