[03:41:40] (03CR) 10KartikMistry: [C:03+2] Smoke test to check deployments [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1103355 (owner: 10Sbisson) [03:42:22] (03Merged) 10jenkins-bot: Smoke test to check deployments [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1103355 (owner: 10Sbisson) [08:21:45] Guten Morgen! [08:47:07] o/ [08:47:20] ml-lab1001's /srv partition is full, mostly due to the hf-cache [08:49:13] o/ lemme take a look to see what I can delete [08:51:16] yes the hf cache takes up 244GB [09:07:52] I managed to clear some using the huggingface-cli but I don't have permissions to delete in that dir [09:08:14] brought it down to 190GB for now [09:13:53] it maybe something to run periodically (the cleanup) because long term it will surely eat home dir space [09:15:28] indeed [11:42:45] * isaranto afk lunch [12:50:41] I freed an additional 22G by running my dedupe tool [14:01:43] klausman: we should think about running some cleanup systemd timer periodically, if the cache is going to be used permanently [16:23:22] In principle I agree. I just haven't implemented it since I had hoped the extra storage we ordered would arrive in time [16:35:22] thanks Tobias! [16:36:12] klausman: sure but even with extra storage we may hit the same issue (ignorant about it, just throwing ideas) [16:38:14] (I assume the cache keeps growing etc..) [16:42:02] we use only a set of models but from time to time it may get bigger if ppl try out stuff [16:42:11] (download and test new models just once etc) [16:42:24] Yeah, I am also not sure yet how to best handle the cache for sharing. Atm it's a env-vars and sticky-bits affair, which is rather hacky [16:43:01] I have an idea for some kind of persistent cache: we clear the cache once in a while and run a script that downloads the models we want on a regular basis. [16:43:54] I can check that tomorrow [16:44:03] going afk for now. Have a nice evening all o/ [16:44:07] \o [20:35:54] (03PS2) 10Umherirrender: build: Updating mediawiki/mediawiki-phan-config to 0.15.0 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1104146 (owner: 10Libraryupgrader) [21:02:09] (03PS1) 10Sbisson: Randomize collection-based recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1105440 (https://phabricator.wikimedia.org/T381888) [21:03:37] (03CR) 10CI reject: [V:04-1] Randomize collection-based recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1105440 (https://phabricator.wikimedia.org/T381888) (owner: 10Sbisson) [21:24:40] (03PS2) 10Sbisson: Randomize collection-based recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1105440 (https://phabricator.wikimedia.org/T381888) [21:37:44] FIRING: LiftWingServiceErrorRate: ... [21:37:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-articlequality&var-backend=nlwiki-articlequality-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [21:42:44] RESOLVED: LiftWingServiceErrorRate: ... [21:42:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-articlequality&var-backend=nlwiki-articlequality-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate