[08:17:00] morning [08:47:39] morning [08:50:39] just saw T406191 [08:50:40] T406191: PAWS server not starting - https://phabricator.wikimedia.org/T406191 [08:50:47] it seems also that tools static is down [08:50:51] I'll check static first [08:51:14] oh, there was an nfs issue I think [08:52:39] I'll reboot stuck workers [08:55:30] this is working ok [08:55:32] https://www.irccloud.com/pastebin/BNUiFTlV/ [08:55:54] there you go, alert resolved [08:55:58] looking into paws [09:23:30] I'm seeing a correlation between OOM happening and nfs issues starting on the paws workers [09:23:30] https://phabricator.wikimedia.org/T406191#11236126 [09:24:57] I think that somehow killing a process that is using NFS by the oom-killer might leave the nfs volume stuck and not release it [09:25:04] I can try to reproduce now though :) [09:57:20] I found nothing else so far, tried creating a processes that would write to a file on nfs (journalctl -f | tee -a outfile), and kill-9 it, but it did not get the nfs mount stuck, might need also to exhaust the memory on the system [09:57:38] maybe setting a stricter limit on the pods/oom side might help [09:57:42] I' [09:57:49] I'll close that task and move to thenext [10:23:05] anyone has opinions on patching all k8s object to have the mount label, or keep interpreting no label as mount all? https://phabricator.wikimedia.org/T405828#11236391 [10:57:16] I'll quickly patch all the objects for now, we can decide later [13:06:24] andrewbogott: it seems like we have a bunch of existing projects where the "allow all internal traffic" security group rule exists for v4 only, we should probably find a way to backfill those for v6 [13:06:37] (just wasted half an hour on discovering that in toolsbeta :/) [13:51:03] tofu patches for moving traffic to the new set of haproxies: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests [13:52:19] taavi: I think I have some code someplace for backfilling security groups, will look [13:56:43] oh 'wmcs-securitygroup-backfill' :) Let's see if that still works... [13:57:23] one thing to be careful about is that we don't want to backfill it to those projects that removed the default v4 rule [13:59:27] you're talking about the rule that's bound to the default group, right? "Ingress IPv4 Any Any - default " ? [14:00:41] yes. for new projects that gets provisioned for both v4 and v6, but some older projects only have a v4 version [14:00:59] including 'testlabs' it seems! [14:01:00] also, it feels very silly openstack doesn't support having a single dual-stack rule [14:01:12] yeah [15:54:40] taavi: want to double-check my reasoning here? [15:54:44] https://www.irccloud.com/pastebin/cjYaGooD/ [15:58:46] that produces this list: [15:58:49] https://www.irccloud.com/pastebin/ZBJK6FMI/ [16:03:09] andrewbogott: seems about right (except you could easily merge those two loops into one), but I can't give a full review right now [16:03:24] 'k [16:03:51] I don't care about code efficiency when I'm only going to run the script once and then throw it away :) [16:53:17] * dhinus off [17:04:35] * dcaro off [17:16:07] can I get a +1 for https://phabricator.wikimedia.org/T406240? It's a big ask but we have space [17:44:24] andrewbogott: i'm doing some light perf testing on the gitlab-runners-staging magnum cluster and i noticed the write (rand and seq) performance of the cinder PVs is quite a bit lower than what we see in digitalocean. wondering what our options are there. would a 4xiops flavor for the node make any difference with cinder volumes? [17:54:49] another option i'm thinking about is to change the buildkitd deployment (my main concern of heavy disk i/o bottlenecks) to use a daemonset and rely on the node/instance disk [18:07:48] dduvall: there should be no real performance between instance disk and cinder. Either can be configured to have faster performance if needed. [18:08:56] we don't have the netework bandwidth to make every volume run at top speed all the time but you can make a quota request to have faster volume types enabled for your project. [19:50:53] andrewbogott: +1d [20:27:22] andrewbogott: makes sense, thank you. i'll see about putting in a quota request. where are the available volumes types enumerated? [20:28:52] dduvall: the only specs that currently exist are: [20:28:55] https://www.irccloud.com/pastebin/5I5ZA6sU/ [20:29:03] but we can create new ones easily [20:29:39] cool cool [20:32:02] i wonder how i get the cinder csi driver to use that type, but one step at a time i suppose. i'll file a task for the quota [21:21:12] oh neat, it seems that `type` is a supported storage class parameter https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/cinder-csi-plugin/using-cinder-csi-plugin.md#supported-parameters