[00:10:15] RECOVERY - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is OK: OK: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:43:55] PROBLEM - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:05:51] Looking in journalctl, it appears the hdfs argument list is too long once again [01:05:51] Oct 03 00:36:02 an-launcher1002 kerberos-run-command[12437]: OSError: [Errno 7] Argument list too long: 'hdfs' [01:09:39] This will be making some noise until we fix it, and some data will not be deleted as timely as we'd like, but once it is fixed, the next run will catch up the delete to where it should be [01:11:18] The above icinga alarm relates to https://phabricator.wikimedia.org/T263495 [04:53:35] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Improve discovery of paths to delete in refinery-drop-older-than - https://phabricator.wikimedia.org/T263495 (10Nuria) mmm, no scratch that , it is not fixed but will fail undeterministically [07:22:37] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Stats for newer projects not available - https://phabricator.wikimedia.org/T258033 (10The_Discoverer) 05Resolved→03Open The statistics are still not available. [08:41:23] razzi: very good explanation about the failed timer! If you want to add a cherry on top of it in these cases: go to https://icinga.wikimedia.org/alerts (shows all the outstanding alerts), select the alert that we know it may keep staying in critical state and select "Acknowledge" from the drop down menu. You can also add the reason, in this case I've put the task link. [08:41:43] In this way all the other sres that look at the page will not spend time on checking that alarm [08:42:06] (the alarm shows up only in this IRC chan from puppet settings, but it shows up in icinga anyway) [08:53:48] ah snap denormalize got killed [08:55:48] seems to be the same issue again [08:55:49] https://yarn.wikimedia.org/cluster/app/application_1600953045299_60201 [10:24:29] elukey: Will try to run the job manually with more resources [10:35:48] !log Manually run mediawiki-history-denormalize after fail-rerun problem (second time) [10:35:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:47:43] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Stats for newer projects not available - https://phabricator.wikimedia.org/T258033 (10mforns) @The_Discoverer This was my fault, the fix by @JAllemandou was finished and merged, but I forgot to deploy it last week. We're working on this, it might take... [10:49:48] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Improve discovery of paths to delete in refinery-drop-older-than - https://phabricator.wikimedia.org/T263495 (10mforns) Yesterday I tested the el_drop_unsanitized deletion job with the newest code (order fix + partial match fix) and it worked well. I think... [11:00:24] (03CR) 10Mforns: Fix directory expansion bug in refinery-drop-older-than (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [11:50:01] joal: <3 [11:50:33] I'm monitoring elukey - users also run big jobs at the same time, putting pressure on the cluster :( [12:01:24] :( [12:02:11] (lunch, will check laterz) [13:21:34] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Stats for newer projects not available - https://phabricator.wikimedia.org/T258033 (10The_Discoverer) No problem, thanks for your work. [14:55:19] Job has run up to the critical stage - That stage is like jumbo-stage (5 different sources joint and unioned together) [14:56:05] I think that materializing and checkpointing intermediate joined datasets would be a good solution to our problem [14:56:27] Going AFK for now, will provide a CR on Monday [21:49:04] 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - regioanl and group evaluations - https://phabricator.wikimedia.org/T264512 (10Kipala)