[08:17:00] <dhinus>	 morning
[08:47:39] <dcaro>	 morning
[08:50:39] <dcaro>	 just saw T406191
[08:50:40] <stashbot>	 T406191: PAWS server not starting - https://phabricator.wikimedia.org/T406191
[08:50:47] <dcaro>	 it seems also that tools static is down
[08:50:51] <dcaro>	 I'll check static first
[08:51:14] <dcaro>	 oh, there was an nfs issue I think
[08:52:39] <dcaro>	 I'll reboot stuck workers
[08:55:30] <dcaro>	 this is working ok 
[08:55:32] <dcaro>	 https://www.irccloud.com/pastebin/BNUiFTlV/
[08:55:54] <dcaro>	 there you go, alert resolved
[08:55:58] <dcaro>	 looking into paws
[09:23:30] <dcaro>	 I'm seeing a correlation between OOM happening and nfs issues starting on the paws workers
[09:23:30] <dcaro>	 https://phabricator.wikimedia.org/T406191#11236126
[09:24:57] <dcaro>	 I think that somehow killing a process that is using NFS by the oom-killer might leave the nfs volume stuck and not release it
[09:25:04] <dcaro>	 I can try to reproduce now though :)
[09:57:20] <dcaro>	 I found nothing else so far, tried creating a processes that would write to a file on nfs (journalctl -f | tee -a outfile), and kill-9 it, but it did not get the nfs mount stuck, might need also to exhaust the memory on the system
[09:57:38] <dcaro>	 maybe setting a stricter limit on the pods/oom side might help
[09:57:42] <dcaro>	 I'
[09:57:49] <dcaro>	 I'll close that task and move to thenext
[10:23:05] <dcaro>	 anyone has opinions on patching all k8s object to have the mount label, or keep interpreting no label as mount all? https://phabricator.wikimedia.org/T405828#11236391
[10:57:16] <dcaro>	 I'll quickly patch all the objects for now, we can decide later
[13:06:24] <taavi>	 andrewbogott: it seems like we have a bunch of existing projects where the "allow all internal traffic" security group rule exists for v4 only, we should probably find a way to backfill those for v6
[13:06:37] <taavi>	 (just wasted half an hour on discovering that in toolsbeta :/)
[13:51:03] <taavi>	 tofu patches for moving traffic to the new set of haproxies: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests
[13:52:19] <andrewbogott>	 taavi: I think I have some code someplace for backfilling security groups, will look
[13:56:43] <andrewbogott>	 oh 'wmcs-securitygroup-backfill' :)  Let's see if that still works...
[13:57:23] <taavi>	 one thing to be careful about is that we don't want to backfill it to those projects that removed the default v4 rule
[13:59:27] <andrewbogott>	 you're talking about the rule that's bound to the default group, right?  "Ingress 	IPv4 	Any 	Any 	- 	default " ?
[14:00:41] <taavi>	 yes. for new projects that gets provisioned for both v4 and v6, but some older projects only have a v4 version
[14:00:59] <andrewbogott>	 including 'testlabs' it seems!
[14:01:00] <taavi>	 also, it feels very silly openstack doesn't support having a single dual-stack rule
[14:01:12] <andrewbogott>	 yeah
[15:54:40] <andrewbogott>	 taavi: want to double-check my reasoning here?
[15:54:44] <andrewbogott>	 https://www.irccloud.com/pastebin/cjYaGooD/
[15:58:46] <andrewbogott>	 that produces this list:
[15:58:49] <andrewbogott>	 https://www.irccloud.com/pastebin/ZBJK6FMI/
[16:03:09] <taavi>	 andrewbogott: seems about right (except you could easily merge those two loops into one), but I can't give a full review right now
[16:03:24] <andrewbogott>	 'k
[16:03:51] <andrewbogott>	 I don't care about code efficiency when I'm only going to run the script once and then throw it away :)
[16:53:17] * dhinus off
[17:04:35] * dcaro off
[17:16:07] <andrewbogott>	 can I get a +1 for https://phabricator.wikimedia.org/T406240?  It's a big ask but we have space
[17:44:24] <dduvall>	 andrewbogott: i'm doing some light perf testing on the gitlab-runners-staging magnum cluster and i noticed the write (rand and seq) performance of the cinder PVs is quite a bit lower than what we see in digitalocean. wondering what our options are there. would a 4xiops flavor for the node make any difference with cinder volumes?
[17:54:49] <dduvall>	 another option i'm thinking about is to change the buildkitd deployment (my main concern of heavy disk i/o bottlenecks) to use a daemonset and rely on the node/instance disk
[18:07:48] <andrewbogott>	 dduvall: there should be no real performance between instance disk and cinder. Either can be configured to have faster performance if needed.
[18:08:56] <andrewbogott>	 we don't have the netework bandwidth to make every volume run at top speed all the time but you can make a quota request to have faster volume types enabled for your project.
[19:50:53] <dcaro>	 andrewbogott: +1d
[20:27:22] <dduvall>	 andrewbogott: makes sense, thank you. i'll see about putting in a quota request. where are the available volumes types enumerated?
[20:28:52] <andrewbogott>	 dduvall: the only specs that currently exist are:
[20:28:55] <andrewbogott>	 https://www.irccloud.com/pastebin/5I5ZA6sU/
[20:29:03] <andrewbogott>	 but we can create new ones easily
[20:29:39] <dduvall>	 cool cool
[20:32:02] <dduvall>	 i wonder how i get the cinder csi driver to use that type, but one step at a time i suppose. i'll file a task for the quota
[21:21:12] <dduvall>	 oh neat, it seems that `type` is a supported storage class parameter https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/cinder-csi-plugin/using-cinder-csi-plugin.md#supported-parameters