[07:59:22] what's up with the clouddumps1002 io alerts? [08:00:17] morning [08:01:19] o/ [08:05:00] there's been a spike on write operations on 1002 for the last ~5.5h [08:16:52] oh, not write no, read xd [08:17:01] (too many colors in the graph, they start repeating) [08:17:04] anyhow, there's a [08:17:11] few universities doing an rsync from it [08:17:56] I think it's ok for now, if it keeps too high (say until tomorrow), or it starts failing we can try to slow down things [08:18:23] that alert has always not been great (most of the time we don't really act on it) [08:53:11] quick review -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142520 udptaing toolsbeta prometheus certs [08:54:27] lgtm [08:54:37] (assuming you have the private key handy somewhere) [08:55:02] yep, uploaded to the puppetserver already [11:25:36] review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142572? [11:26:27] +1d [11:27:09] thanks! [11:35:51] no, that wasn't it :( [12:02:56] super weirdly the same issue happens when using :close() instead of :set_keepalive() [12:11:34] looks like an interesting rabbit hole :) [12:12:33] yeah [12:12:43] i'm starting to be convinced that this is an nginx bug [12:54:57] andrewbogott: at which point in the debian release process can we spin up a trixie prerelease image in codfw1dev? would be great if the nginx version in trixie is doing the same thing [12:58:04] given that there's a few of us on pto, I'm thinking on moving the toolforge service check-in to next week (that way it also avoids overlapping with the monthly meeting), anyone prefers to do it today? [12:58:47] moving it sgtm [12:59:14] taavi: looks like there are already dailies at https://cloud.debian.org/images/cloud/trixie/daily/ so I'll see if I can build one now. If that fails you can always just use a raw upstream image. [13:25:13] taavi: cloud-init seems to not work properly in those dailies but it works well enough to inject a key. debian-13-raw-1 in 'testlabs' and the key is added for user 'debian' [13:31:28] next fix attempt that seems more promising than the first one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142598/ [13:32:27] so it's a lua version diference [13:33:15] * andrewbogott lacks context but that change looks harmless at worst [13:34:32] i have that applied as a local hack on proxy-5 and so far i haven't seen that error happening after applying it [13:34:48] while previously it was pretty well reproducible by doing a hard refresh on quarry.wmcloud.org when on a v6 connection [13:35:31] xd, it was already merged, +1d anyhow [13:35:32] so i'll merge and try re-enabling v6 for tools-static to generate more traffic [13:35:53] does it only happen on v6? [13:36:23] only v6 traffic is being pushed to the new bookworm proxies for now [13:36:34] ah okok [13:36:41] since i need to backfill security group things before flipping the switch for v4 [13:43:05] so far the fix is looking very good [14:22:16] andrewbogott: to answer your question I'm not 100% sure what the normal allocation workflow for those cloud-private IPs is [14:22:23] certainly they get added to Netbox: [14:22:24] https://netbox.wikimedia.org/ipam/prefixes/657/ip-addresses/ [14:22:42] but I'm not sure exactly what configures them on the host side, it may be in puppet/hiera [14:22:49] I suspect taavi might be able to advise us here [14:23:07] is the networking (in terms of vlans on ports) correct on the switches for those hosts you mentioned? [14:25:31] I don't know. I was assuming so since they used to be cloudcontrols but let's look... [14:26:44] which hosts is this about? [14:27:11] cloudrabbit200[123]-dev, recently renamed from cloudcontrol200[789]-dev [14:27:25] they do not have the rabbit puppet role applied to them yet [14:28:19] ok, so the old names already had cloud-private addresses allocated https://netbox.wikimedia.org/ipam/prefixes/657/ip-addresses/ [14:29:00] that's good, so let's see if I can edit those records... [14:29:02] i think updating the dns names (and running the netbox dns cookbook) should be enough for that, the netbox puppetdb import script will take care of the rest [14:29:02] i think updating the dns names (and running the netbox dns cookbook) should be enough for that, the netbox puppetdb import script will take care of the rest [14:29:04] oops [14:30:21] doubly correct :) [14:31:03] ...is that 'sre.dns.netbox'? [14:31:07] the netbox puppetdb import script will attach them to the hosts properly, but that is just for information and as you can see hasn't happened will all of them [14:31:17] andrewbogott: yes that's the one [14:31:37] with no args, right? [14:31:46] * andrewbogott is not looking forward to breaking DNS foundation-wide [14:31:51] yep no args [14:31:59] ok, running [14:32:04] it'll prompt you with the diff anyway, which should make sense (i.e. changing the names you modified) [14:33:10] also I checked the switch ports for those hosts they are correct [14:36:00] it seems to be doing the thing [14:37:48] ftr there was another pending change elasticsearch->cirrussearch [14:42:06] probably you can ask Brian King about that, but I know they are moving and renaming those hosts so probably fine [14:42:21] I generally say "yes" if it looks like regular work happening in paralell, with say a given host [14:42:38] the thing to watch out for is it's not deleting "en.wikipedia.org" or something :) [14:53:51] yeah, I know they're renaming things so it seemed unsurprising [14:56:03] dcaro: one thing I forgot to mention in the checkin: I cleared out cloudcephosd2004-dev from the pool and re-added it and now it seems to work fine. It is much bigger than the other osds in codfw1dev so balancing may be a bit silly until we get some more large ones there. T392366 [14:56:04] T392366: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366 [14:56:25] andrewbogott: thanks! [14:56:38] chuckonwu: we don't have yet security groups on tofu-provisioning right? [14:56:50] I also drained and decom'd cloudcephosd100[123] over the weekend [14:57:44] dcaro, we don't [14:57:45] I saw that :), I was 'snooping' while in the hackathon, thanks too, it took a while [14:58:12] chuckonwu: okok, I think we are missing the bastion security group for toolsbeta, I'll add it (that's why taavi was unable to ssh directly, but through a bastion only) [14:59:07] 👍 [15:01:51] works for me now sshing without proxy :) [15:02:00] *jumphost [15:02:17] (and of course, the last 4 runs of the functional tests now pass...) [15:37:53] that keystone alert is just me restarting things, it will clear on next check [16:29:14] andrewbogott: did you change anything ldap-wise? I can't sudo on the toolforge bastions (12 or 13) [16:29:24] (it works on others, like tools-k8s-control-0 [16:29:32] *tools-k8s-control-9 [16:29:49] Nothing I changed should affect ssh [16:29:52] oh, now it works, maybe ldap weirdness [16:30:11] that's unsettling [16:30:44] https://www.irccloud.com/pastebin/obGNRDu9/ [16:30:54] hmm... now it's hanging [16:32:11] now it works :/, something is going on [16:32:33] Child [1796441] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason. [16:32:34] yep ldap [16:38:56] hmpf... it's been unstable for a while today, and also yesterday morning [16:39:00] (utc time) [16:39:35] quick review (increasing the cli timeout, otherwise the functional tests fail) https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/101 [16:45:16] dcaro: are you seeing those issues everywhere or just on the tools bastion? [16:45:58] let me see, only checked the logs on the bastion [16:46:38] there's some in other VMs (ex. tools-k8s-control-9), though way less [16:47:02] it's actually a bit different there, out of memory: (2025-05-06 2:44:35): [sssd] [netlink_fd_handler] (0x0040): Error while reading from netlink fd: Out of memory [16:47:34] there's a different one too regarding ldap `(2025-05-06 11:50:09): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]` [16:47:46] I have not had sudo fail there though [16:48:30] it seems better now though [16:50:50] the last error was ~20min ago [16:52:56] Could it just be memory issues on the VMs, and sssd is just a symptom? [16:59:43] for the control node it looks that way (though not very impactful there), for the bastion not sure, let me do a couple checks [17:04:24] hmm.. I don't see anything pointing that way (though could be), the bastion is kinda idle mem-wise, and there's no logs of memory issues in dmesg/journal (or at list I did not see on a quick look) [17:05:21] ok, so maybe a real ldap issue. I wonder if gerrit relies on/puts a load on ldap? Seems like it was having trouble ~70 minutes ago [17:05:37] It has its own user DB but probably does an ldap lookup for an unknown user [17:05:46] interesting [17:06:40] No evidence for that theory other than timing [17:07:09] if ldap failed, it would also be seen there I guess, specially if it uses it "often" enough [17:07:21] yeah, could be cause or effect or neither. [17:07:24] I need to get lunch and run some errands... things are stable enough for me to vanish? [17:07:35] yes yes, it's working now [17:08:51] ok [17:13:52] * dcaro off [17:14:17] feel free to ping me on telegram/whatever if I can help, cya tomorrow! \o [17:36:29] Is it fine to put a link to the repository in Gitlab on related Wikitech pages? [17:48:10] yes! [18:04:58] T393496 is a quota bump request for the zuul3 Cloud VPS project. The initial cpu, ram, and volume quotas are being mostly consumed at this point and folks are hoping to ramp up activity as our vendor contract gets finalized with the author of zuul itself. [18:04:59] T393496: Increase zuul3 quotas for cpu/ram/disk/instances - https://phabricator.wikimedia.org/T393496 [18:42:50] ^ done [19:48:30] thanks andrewbogott