[08:29:58] pc1017 had a replication issue I just fixed [08:30:07] There was no user impact as it is a spare host for now [08:30:38] ack, tnx [08:31:51] db2230 is also alerting, but that's a test host, but I will get it fixed too [10:29:34] arturo: I just noticed the wmcs paging, that's due to https://gerrit.wikimedia.org/r/c/operations/alerts/+/1101019 FYI cc dhinus [10:34:44] !incidents [10:34:44] 5526 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [10:34:45] 5525 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [10:34:45] 5524 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [10:34:45] 5523 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [10:34:45] 5522 (RESOLVED) db2207 (paged)/MariaDB Replica SQL: s2 (paged) [10:34:45] 5521 (RESOLVED) db2207 (paged)/MariaDB Replica SQL: s2 (paged) [10:37:30] cc dcaro ^ [10:38:10] thanks [10:38:57] godog: thanks [10:40:03] godog: do you need me to do anything? (I'm in an offsite, but can step aside if you need anything) [10:42:04] dcaro: I don't know tbh, the alerts were not firing at all before https://gerrit.wikimedia.org/r/c/operations/alerts/+/1101019 so I'd say safe to silence at least immediated [10:45:39] things are up yep, we are looking, maybe those alerts were not being evaluated before? [10:46:03] that's right yes, my bad I said firing and should have said evaluated [10:48:14] ack, yep, they are wrong xd, a <= should be < [10:48:24] will send a patch soon [10:54:30] https://gerrit.wikimedia.org/r/c/operations/alerts/+/1101477 [17:54:02] general heads-up if there are reported issues with video transcoding: one of the two video transcoding jobs has been moved to mercurius in the mw-videoscaler namespace. In order to restore the old pattern, revert Ia5264f7101fcd54dbe19d797da66372e4db039e7 in deployment-charts, and do an apply in changeprop-jobqueue [17:54:14] Also in admin codfw, do `kubectl -n mw-videoscaler delete jobs.batch mediawiki-main-mercurius-webvideotranscodeprioritized` [17:56:47] So the root disk on build2001 is a bit full-ish. Is there a usual cleanup procedure? (one note: /usr/lib/debug/lib/modules is 23.9GB. Do we actually need all the kenrle module symbols for 5.10.0-{19,22,25,27,30}? [18:10:13] it's almost all docker images, which I imagine have been published already. Will it break anything to just do a docker system prune? [18:10:44] no idea. I am currently waiting for a image to be published, so maybe don't prune _right now_ ;) [18:33:51] ok, published. That took way longer than I expected [18:34:08] hnowlan: I can run the docker prune, but I am still usnure it's safe [18:34:42] the first thing I'd always do for low disk space is "apt-get clean". Sometimes that frees quite a bit of space. But not in this case. So next would be usually to check /var/log* [18:35:36] also not much to clean there. So I guess it's actually the images [18:36:26] well, the home dirs. one could email all the users with home dirs over 1G to ask them to clean up their own files [18:37:28] 3.2G can be free by removing jbond's home dir. the use does not exist. home dirs don't get deleted when users get deleted. [18:39:54] percentage-wise that isn't a lot though and likely won't even push it under the monitoring threshold.. so.. dunno [18:44:09] Wait, we don't clean up homedirs of people that left? [18:44:18] nope [18:44:29] That seems... not ideal. [18:44:54] sometimes we also sync home dirs from old to new servers and it's possible for them to survive basically forever [18:45:20] I don't think it's as easy as saying "nuke the whole dir the day they leave" either though [18:45:44] Well, no, but quarantine-wait-backup-delete would be an option [18:45:45] imaging if you do this on people hosts and then some tools are gone that others use, as one random example [18:46:52] yea, maybe some process that finds oldest files first and emails (but who) [18:48:12] I'd expect there to be a series of steps (playbook) for "user has definitely left" [18:48:52] But I know that there are people that move between volunteer and employee state, and just nuking after they move out of employee state may break their volunteer-ness [18:49:13] one offboarding step is to decide whether the user is gone or stays around as volunteer, fwiw [18:49:31] then often people say they might volunteer but if they actually do is another story [18:49:44] yea, that [18:51:13] for people* machines I have script that sends an email about especially large home dirs.. then the rest from there would still be manually to ask the users.. but at least it points out the big ones [18:52:43] back to build2001.. it's about /var [18:53:46] /var/lib/docker is the culprit here, the rest is minor. So I think that needs a ticket for service owner [18:53:52] yeah, I think the strongest candidate for cleanup there is the docker images [18:53:58] indeed, 128G [18:57:50] Well, no idea who owns these machines, and wikitech doesn't make it obvious [19:16:10] klausman: It’s probably I/F [19:34:36] klausman: the source of truth who owns stuff is technically in the puppet repo. check "grep -r role_contacts" in hieradata [19:34:55] that confirms IF