[13:27:17] BTW, why systemd::sysuser {'liberica': } won't create a User['liberica'] resource? [13:27:34] what's the use case for that kind of sysuser? [13:51:08] does anyone have experience using I/O cgroups? I've been using https://facebookmicrosites.github.io/cgroup2/docs/io-controller.html but it's a lot of trial and error. Context is T376426 (intensive stat jobs locking up HDDs) [13:51:09] T376426: Improve developer experience on stat hosts part 2 - https://phabricator.wikimedia.org/T376426 [13:52:46] inflatador: are you sure disk I/O is the problem and not just a symptom? [13:58:36] for machines going hard unresponsive like on the stat hosts I'd suspect memory pressure before disk [13:59:20] cdanis I think it's the best explanation based on the node exporter data ( https://grafana.wikimedia.org/goto/2_nmxtmNR?orgId=1 ) . We already have memory and CPU throttling so I'm pretty sure it's I/O . Open to suggestions though [14:00:29] more context here as well https://wikimedia.slack.com/archives/CSV483812/p1728657594135219?thread_ts=1727986569.023549&cid=CSV483812 [14:10:39] inflatador: 90% system cpu usage, and 0% iowait and 0% user, is almost certainly the machine thrashing on vmem [14:10:51] I think whatever cgroup memory limits were added might actually have made the problem more prone to happening [14:10:52] <_joe_> inflatador: https://grafana.wikimedia.org/d/000000342/node-exporter-server-metrics?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-node=stat1011:9100&var-disk_device=All&var-net_dev=All&from=1728650253386&to=1728661447692&viewPanel=7 seems to suggest that it was all "system" cpu time but not "iowait". Which would point to vmem or context switching [14:11:19] you can also see in the RAM chart that the machine actually has free memory but is refusing to use it [14:11:38] I bet all of the disk reads during the interval are cold pages of executables being evicted by kswapd [14:11:49] (and then re-read once they're needed) [14:12:59] <_joe_> here's another interval where a first peak might be due to iowait, but the second has almost surely something to do with memory allocation [14:13:03] <_joe_> https://grafana.wikimedia.org/d/000000342/node-exporter-server-metrics?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-node=stat1011:9100&var-disk_device=All&var-net_dev=All&from=1729006710499&to=1729036687542 [14:13:29] <_joe_> (look at the used ram profile) [14:13:36] do we change any of the oom killer tunables? [14:13:55] this isn't the first time recently where it would have been better if the OOM killer had kicked in (see also the puppetserver issues) [14:16:41] hmm, we do set vm.swappiness=>0 [14:18:06] we haven't messed with oom settings yet. That same facebook page mentions oomd, I was thinking of trying that as well...seems similar to momd which I used in a former life [14:18:55] > In practice at Facebook, we've regularly seen 30 minute host lockups go away entirely. [14:18:57] heh [14:51:48] Thanks for the help. So what would explain why the host is refusing to allocate more memory? I noticed the lockup is at about 1/2 the total memory, does that seem significant? Wondering if numa might be a factor [14:52:30] how are memory cgroups set up right now? [14:52:33] <_joe_> inflatador: do you already have cgroups for payloads I guess? [14:52:37] <_joe_> heh :D [14:54:10] my guess as to what's happening is, at least one of the cgroups is under intense memory allocation pressure, and the kernel is doing everything it can (aside from invoking the oom killer) to provide backing physical memory [14:54:17] this is the only lockup that's happened since we enabled memory cgroups. There are a lot that happened before then, so I don't think the cgroups are a factor [14:54:52] but "everything it can" means "evicting program code that's backed by disk" [14:55:09] Let me dig up some prior links for y'all [14:55:46] I'll link the cgroups config too...1 sec [14:55:47] I totally believe you that the per-user(?) in systemd(?) memory cgroups helped reduce lockups [14:55:50] np, no rush [14:57:43] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/statistics/files/individual-user-resources.conf [14:58:41] we also have 4 GB of zRAM for swap so we can avoiding hitting the HDDs [14:59:56] oh cool [15:02:45] I think in that case a non-zero value of swappiness makes sense [15:02:56] `vm.swappiness` sysctl [15:03:12] yeah, I was thinking about that too [15:03:35] we set it to 0 on production hosts (but that's maybe just wrong overall) [15:03:40] Should've done that before I went on vacation and forgot everything ;) [15:04:08] The stat hosts are super heterogenous...until last wk 2 of 'em didn't even have swap [15:04:52] with a little bit of code, you could do some per-host tweaks in hiera [15:06:54] another thing I was thinking about was reducing MemoryMax in individual-user-resources.conf [15:07:54] but I'd probably try the swappiness thing first, it will at least mean you don't hit a system CPU wall when you're close to but not quite at the point of invoking the OOM killer [15:08:44] I'm gonna turn off IRC for focus time for a while though, ttyl :) [15:08:50] <3 [15:08:59] thanks again, I'll leave you be ;) [17:18:50] re: the email thread about moving machines out of the legacy codfw network. How do you know from the IP alone if it's in the legacy VLAN or not? for example I see 3 gitlab-runners are listed and looking at their IP they are in 10.192.16.0/22 10.192.32.0/22 and 10.192.48.0/22 respectively. so those are all legacy, right? It seems I can go by the rule "if it's called private-[a-z]-codfw then [17:18:56] it's legacy" (as opposed to private-{a-z}{0-9} with an additional number). [17:19:11] Or simpler, "if VLAN ID is lower than 2022 then legacy" [17:20:41] mutante: correct, row-based VLAN names like private1-b-codfw are old ones while rack-based ones like private1-a3-codfw are new ones [17:21:33] volans: ACK, thank you. I will try the coobook with --move-vlan on one of the runners [17:21:59] you can check the vlan an IP/prefix belongs to in netbox just searching the prefix like https://netbox.wikimedia.org/search/?q=10.192.16.0%2F22 and then looking at the specific object like https://netbox.wikimedia.org/ipam/prefixes/137/ [17:22:21] same with one click more if you start the search from the IP :) [17:23:15] yea, that works. Just wanted to verify which are the legacy prefixes to make sure. [17:23:54] checks "VLANS with ID lower than 2020" [17:29:14] mutante: also quick rule of thumb if the IP has a /22 netmask it is on legacy vlan, /24 means it's on new one [17:30:26] topranks: ah! makes sense, ack [17:36:29] this mostly works for me to "find all servers owned by my subteam that are affected": sudo cumin 'A:owner-collaboration-services' 'ip address show dev eno1 | grep -o "\/22"' [17:36:44] via the owner in puppet and the netmask [17:50:07] mutante: does this generates the same list? [17:50:07] cumin 'A:owner-collaboration-services and A:codfw and not P{F:netmask = "255.255.255.0"}' [17:50:14] without ssh-ing, pure query [17:52:42] volans: no, that somehow show more hosts, 18 hosts [17:53:15] as opposed to just 4 (or 8 for both codfw and eqiad) [17:53:29] the new vlans are only in codfw right now [17:53:32] but the first result matches the netbox report, so I think it's good [17:53:46] yea, codfw first, eqiad later, I assume [17:54:53] ah yeah mine includes public IPs [17:54:55] let me amend [17:55:04] posting here for visibility: my sre.dns.netbox -> sre.puppet.sync-netbox-hiera run picked up a pair prefix description changes (https://phabricator.wikimedia.org/P70180) that were recently made in netbox. flagging here in case anyone expects diffs but does not see them :) cc: papaul, FYI [17:57:27] ok this one should do: [17:57:28] 'A:owner-collaboration-services and A:codfw and P{F:fqdn ~ ".wmnet$"} and not P{F:netmask = "255.255.255.0"}' [18:00:29] yea, that works. the difference is that virtual machines are added in addition to physical hosts. because the NIC has a different name, ACK, thx [18:03:19] added to the wiki: https://wikitech.wikimedia.org/w/index.php?title=Vlan_migration&diff=2236184&oldid=2236062 [18:03:55] great [18:17:00] you could possibly also do just 'A:codfw and P{F:netmask = "255.255.252.0"}' [18:17:32] but the wmnet$ filter to not include the public vlans also achieves the same [18:29:28] swfrench-wmf: those are ok yes, pa.paul changed the descriptions earlier on in netbox and they are correct [18:32:40] topranks: thanks! yeah, I figured these are low-risk, since they're simply descriptions. I mainly wanted to flag in case someone expected to see them in a later sync-netbox-hiera run, but did not :) [18:33:21] yep thanks for the heads up - always good to know in case something like that happens by mistake [18:48:13] mutante: I've updated the query as actually this applies to bare metal only, sorry for the confusion, I had forgot VMs are part of a different migration [18:51:52] thanks for the quick feedback and useful commands! [19:11:41] ah, ok! removing the VMs again [19:12:43] so VMs dont need to be reimaged? or just later? [19:19:01] The current ganeti setup requires they remain on the row-wide Vlans so we can migrate VMs from one host to another in different racks [19:19:30] aha, gotcha! ty [19:19:36] The plan for those longer term is to move them to the "routed ganeti" setup, which uses a completely different network setup and allows for migration without needing the row-wide vlans [19:19:47] ok [19:21:27] https://wikitech.wikimedia.org/wiki/Ganeti#Routed_Ganeti [19:21:48] but for now just ignore the VMs thanks, we're not looking at those just yet [19:22:49] Arzhel did a better write-up but I can't find it at the moment [19:29:52] the cookbook makes the assumption that the very first puppet run works and then all is good. this is fine for many servers, but for everything that has a service that is deployed by scap it usually fails. Typically it needs: puppet run, scap initial deploy, scap deploy, puppet run again and THEN it's fine. This means though that the reimage cookbook always tries a long time and ends as failed. so [19:29:58] yea, more to figure out there, especially when scap itself doesnt get installed [19:30:10] https://phabricator.wikimedia.org/phame/post/view/312/ganeti_on_modern_network_design/