[09:38:10] anyone interested in reviewing the patches from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155602? [10:31:26] tools-prometheus-8 seems to be having issues (I can get in the console, but it's really slow), I'll reboot [10:37:37] hmm.... something might have gone wrong with the network? [10:37:42] https://www.irccloud.com/pastebin/ciPMCpqb/ [10:39:11] it's up and running now though [10:40:27] the load on that host spiked before it went unresponsive https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&from=now-3h&to=now&timezone=browser&var-project=tools&var-instance=tools-prometheus-8 [10:40:31] i doubt that's a network issue [10:46:06] it might be cause by it not being able to write to disk or similar [10:46:15] (io contention) [11:30:26] I'm awake early and I've got questions! dcaro, I'm trying to get the networking setup right on these two tickets T378828 and T394333 but I'm confused about cloud-storage vs cloud-private networks. [11:30:27] T378828: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828 [11:30:27] T394333: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333 [11:31:08] Specifically, I can't find any hosts in netbox that look like they're connected to cloud-storage. I'm thinking either there was a naming change that I missed or else I'm confusing physical networks with virtual networks or something like that. Can you help me understand? [11:31:46] I'll be there in a bit (lunch), we can give it a look [11:32:02] ok! No rush, none of this work is getting done today. [11:32:12] * andrewbogott should eat too [11:33:15] andrewbogott: cloud-storage IPs are unfortunately not assigned via Netbox, which is why it looks a bit confusing there [11:33:38] in particular, the main place where that network appears is the switch port config, for example https://netbox.wikimedia.org/ipam/vlans/142/interfaces/ [11:34:36] Yep, that makes sense (and is familiar). I guess I'm confused about what that looks like to dc-ops though. [11:41:10] And, oddly unrelated, I'm trying to reimage cloudcephosd1014 (which has been in service for ages) and the dhcp stage is failing in the debian installer. Maybe that's a topranks question? [11:50:52] taavi, I saved a fun puzzle for you in case you don't have enough mystery in your life: On eqiad VMs, k3s always works fine. On codfw1dev VMs, it never works and complains about overlayfs. I saved two (to me, identical) VMs, k3stest2.testlabs.codfw1dev.wikimedia.cloud and k3stest1.testlabs.eqiad1.wikimedia.cloud for you to explore. [11:51:11] To see it happen, "curl -sfL https://get.k3s.io | sh -" and then "journalctl -fu k3s" [11:51:40] (That is not even all of my questions from last night yet) [11:53:03] andrewbogott: on the codfw1dev one, `grep -Hrn overlay /etc` reveals there is a `/etc/modprobe.d/blacklist-wmf_overlay.conf` file blocking the required kernel modules from loading [11:55:48] whaaaat? Where did /that/ come from? [11:56:06] it has a `# This file is managed by Puppet` comment at the top [11:56:14] anyway, if you have more mysteries I can look after lunch :D [11:57:23] See, this is why I saved this for you rather than staying up all night trying to figure it out myself. Have a good lunch! [11:58:29] andrewbogott: failing dhcp inside the debian installer is normally a problem with the firmware of the NIC [11:58:47] systems dhcp twice during reimage, once the BIOS/PXEboot stage, and again in the installer [11:59:02] Yeah, and clearly the first part is working [11:59:04] if it gets to the installer it means the setup/netbox/switch bit is ok for dhcp [11:59:07] yeah [11:59:08] ok, I'll do a fw upgrade [11:59:29] but what we have seen is driver init fails in debian installer, it's a bug in some of the firmware versions [12:02:12] topranks: while I'm bothering you... I think there's a blocker on T393614 about switching the port for another server? I can't find the subtask but iirc it was assigned to you, do you see it? [12:02:13] T393614: Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614 [12:02:16] * andrewbogott still searching [12:05:00] ah, I think it's this https://phabricator.wikimedia.org/T396363 [12:08:49] andrewbogott: ok thanks for the heads up I'd missed that [12:09:06] the port move is disruptive so I guess we need to depool the host before moving the link? [12:09:33] Yeah. I think that host is what we're refreshing so probably I can drain it and leave it drained. [12:09:35] I'll start that in a bit [12:09:43] we can probably do the work in about 5 mins so alternately if things can weather one host link down for that long its an option [12:12:08] I'm telling it to drain, we'll see how that goes [12:24:39] * taavi back [12:25:37] seems like CI is very slow or stopped but for overlayfs... https://gerrit.wikimedia.org/r/c/operations/puppet/+/1156321 [12:26:55] ah, there we go [12:28:22] andrewbogott: did you have any more mysteries for me? [12:29:14] there's the one about attaching a second IP to a VM but I don't have the test case set up yet, will do that now. Meanwhile... +1 for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1156321 ? [12:29:33] +1 [12:34:55] thanks. [12:35:18] OK, capi-4.octavia.codfw1dev.wikimedia.cloud has 172.16.131.86 attached and I want to be able to to do this: [12:35:29] https://www.irccloud.com/pastebin/ksvBBeCQ/ [12:35:50] (I don't think the pubkey is actually installed there but that doesn't matter for current purposes) [12:38:09] ok, so the second interface is there but there's nothing setting it an address https://phabricator.wikimedia.org/P77832 [12:38:31] possibly stupid question: why does this need to happen over that secondary interface? [12:39:55] taavi: do you mean, why not just have one IP, only on the octavia-mgmt network? [12:40:29] more like "why does this need a leg in the octavia-mgmt network?" [12:41:01] oh! That's because it's going to be a worker node for magnum, so the magnum services on cloudcontrols need to talk to k8s on it. [12:41:09] Could do that via proxies and such instead if need be. [12:41:29] I was just thinking, since we already have the octavia-mgmt network and octavia is using it to talk to VMs, etc. etc. [12:41:34] and why does that need to happen via the octavia-mgmt network and not the "normal" network? [12:42:00] the normal network isn't visible to cloudcontrols [12:42:05] but it could be proxied somehow. [12:42:22] * andrewbogott ready to be surprised when taavi says 'yes it is' [12:42:40] cloudcontrols have exactly the same connectivity to both of those networks [12:42:58] so maybe that means that we don't need the octavia-mgmt network for octavia either? [12:43:21] taavi@cloudcontrol2010-dev ~ $ ssh 172.16.129.116 [12:43:22] The authenticity of host '172.16.129.116 (172.16.129.116)' can't be established. [12:43:37] * andrewbogott scowls [12:43:39] if octavia can use the regular network, sure [12:44:30] somehow i had the impression that that network existed because octavia wanted to be separate from everything else, not because that it needed direct connectivity [12:44:35] have cloudcontrols always been able to route directly to the cloud network or did that change when the cloud-private things were introduced? [12:45:26] always but with a very giant asterisk, cloud-private made that not a very massive hack [12:45:39] ok -- that's likely part of why I'm confused. [12:46:00] beforehand it was very heavily firewalled and discouraged [12:46:01] So I expect you're right and I'm making this more complicated than I need to. I'll see if I can get capi working over the primary network for now. [12:46:31] And will leave octavia as it is since it's already the way the docs and devs want it. [12:46:51] I think that's all my mysteries for now. Thank you! [12:49:14] oh, nope, I already have a new one (although not especially a taavi question), how do I upgrade firmware on a host that's not in puppetdb? Spicerack tries to do a wildcard match and then says 'spicerack.remote.RemoteError: No hosts provided' [12:50:03] oh, probably --new [12:50:14] yep [12:51:28] grrr now [12:51:29] cloudcephosd1014.eqiad.wmnet: skipping DellDriverCategory.NETWORK has no member [14:12:56] can I please get a +1 on this MR? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/247/diffs [14:15:18] i generally don't see a point of getting a proper review for those MRs, the project creation was already approved on the task and the patch was created by the script [14:16:15] but done [14:19:04] fair point, but for simplicity I like keeping the simple rule "all MRs get a review before merge" [14:19:31] (also I could easily mistype the project description or CPU amount, etc.) [14:20:30] side note: I pressed some keys while the cookbook was waiting at "Enter go when the patch is merged", and it aborted [14:20:46] it's not idempotent so I have to manually complete the remaining steps [14:23:15] ohh there is --skip-mr, nice [14:24:47] in general the current system where it's half cookbook and half tofu seems really confusing in my mind [14:30:30] I had a long discussion about it with dcaro where I was initially against having the cookbook at all, but he convinced me there is value in it. :) I'm not happy with the current workflow though, maybe we should start from T385604 [14:30:30] T385604: Decision Request - How openstack projects relate to tofu-infra - https://phabricator.wikimedia.org/T385604 [14:33:56] I'm pretty busy lately, so I might not be able to push that task for some time (until beta I guess), if you think it needs more priority let me know and I can try to shuffle things [14:39:11] I can definitely wait a bit longer :) [14:40:23] the workflow worked mostly fine today, there is no big blocker that needs to be resolved immediately [14:47:05] hmm... prometheus tools is down? [14:47:21] prometheus.svc.toolforge.org/tools gives me 503 (same from grafana) [14:47:22] looking [14:49:26] dcaro: I just drained and reimaged cloudcephosd1014 and it came up with all the osd lvm partitions gone. Is that expected or did I do something wrong? I was thinking that I would just use the 'undrain' cookbook after upgrade but now I'm thinking I need to do wmcs.ceph.osd.bootstrap_and_add instead of just repooling like I expected [14:49:38] (in which case I should have destroyed it before rather than just draining I guess) [14:50:19] andrewbogott: I think reimage should not destroy existing partitions, but I might be wrong [14:50:21] andrewbogott: yep, currently the flow is destroy->reimage->bootstrap [14:50:40] dhinus: yeah, shouldn't, but does [14:50:49] ok, I will see if I can destroy it now that it's post-reimage... [14:51:05] we might be able to tweak the scripts that delete the drives to not do it, but never got around to it [14:51:07] I want to refine this workflow but will wait until we can do it in codfw1dev [14:51:26] I'm sure that when I reimaged clouddb hosts, the partitions were still there. but ceph hosts have different partman recipes [14:51:42] yep [16:10:34] topranks: you were right about the firmware on cloudcephosd1014. Unfortunately the very next server cloudcephosd1015 won't even pxe boot. I've already done firmware upgrades on it, no change. [16:15:31] Is the fact that I can do `become toolname echo foo` expected behaviour? (Just trying to figure out why it doesnt work in lima-kilo [16:16:08] Looks like they are using different scripts [16:17:56] addshore: yep, lima-kilo has a silly wrapper only [16:18:21] it's expected behavior, I think we even have some steps that do so in the wiki howtos and such [16:18:55] okay great, I might look at tweaking the limakilo wrapper then :) [16:19:16] 👍 [16:21:35] if you are curious, the prod script is installed with the package https://gerrit.wikimedia.org/r/plugins/gitiles/labs/toollabs/+/refs/heads/master [16:23:52] ...I don't see any of the dhcp requests I expect but I might be looking in the wrong place (tcpdump -vvv 'udp and (src port 67 or src port 68 or src port 69)' | grep cloudceph) [16:25:04] addshore: hmm, just manually install that package (sudo apt install misctools), and it seems the become from there does kinda work already :), with sudo `sudo become tf-test echo foo` [16:25:09] maybe we can use that package instead [16:25:48] (no rush at all, can be done later, changing the current script is also ok) [16:26:32] andrewbogott: if you don't grep do you see other requests? (the filter looks ok on a first look) [16:26:47] yeah, I see lots of other traffic w/out the grep [16:27:03] I guess I need to stop assuming this ever worked and check the bios settings [17:17:22] * dcaro off [17:17:24] cya on monday! [22:31:02] Seeking reviews on https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/87