[00:49:57] RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:53:56] PROBLEM - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:11:33] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10Nuria) @MMiller_WMF we missed this month deploy of this change, will it be oK to wait for the run of November 1st or you needed it sooner? [06:05:03] klausman: good morning :) I am deeply sorry for the stat100x, I forgot to check at the start the dhcp/pxe config, 1004/6/7 have been re-installed with stretch :( [06:05:38] and nobody really thought to check /etc/debian_version or similar after 1004 :( [06:17:31] /o\ [06:17:43] * joal feels super sorry :S [06:19:32] we need to redo the work again [06:21:05] anyway, trying to re-run drop-el-unsanitized-events.service on launcher [06:21:49] ack - launching a manual sqoop for page_props and user_properties [06:21:56] RECOVERY - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is OK: OK: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:25:14] this is still running --^ [06:25:19] now the weird things is [06:25:20] Oct 02 00:00:01 an-launcher1002 systemd[1]: Started Drop unsanitized EventLogging data from the event database after retention period.. [06:25:23] Oct 02 00:00:01 an-launcher1002 kerberos-run-command[21654]: User analytics executes as user analytics the command [06:25:28] Oct 02 00:46:07 an-launcher1002 kerberos-run-command[21654]: ........................ [06:25:31] Oct 02 00:46:07 an-launcher1002 kerberos-run-command[21654]: ---------------------------------------------------------------------- [06:25:33] Oct 02 00:46:07 an-launcher1002 kerberos-run-command[21654]: Ran 24 tests in 0.020s [06:25:39] 46 minutes? [06:25:52] (then it appears an error later on) [06:26:24] hm [06:28:04] some patches were merged yesterday for the script https://phabricator.wikimedia.org/T263495 [06:28:11] it is probably related [06:28:26] I assume it is elukey [06:28:28] the script ends up with hdfs ls listing too many files and failing [06:29:32] elukey: IIRC the fix dpeloyed yesterday was to prevent this case preciselk [06:31:17] elukey: wait, what? When I reinstalled them, they went to stretch? [06:31:30] klausman: yes... [06:31:41] Oh man. Well, at least it was good exercise :D [06:31:42] we didn't really check what os was installed [06:32:03] At least this time around, we can probably skip the backups? [06:32:05] I am really sorry, it didn't occur to me to triple check [06:32:19] I missed it as well, don't beat yourself up over it [06:32:20] in theory yes, the reimage seems pretty solid [06:32:56] We still have the 1007 backup, if we do the real deal with that one first, we can be sure the actual buster install is fine without risking too much [06:33:50] Should I send out a warning mail now about 1007 being reinstalled on Monday? [06:33:51] !log Manually sqoop page_props and user_properties to unlock mediawiki-history-load oozie job [06:33:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:36:08] 10Analytics, 10Analytics-Kanban: Improve discovery of paths to delete in refinery-drop-older-than - https://phabricator.wikimedia.org/T263495 (10elukey) There are two alarms in icinga right now: * stat1007 - search-drop-query-clicks.service ` Oct 02 03:30:03 stat1007 kerberos-run-command[49936]: ------------... [06:38:37] klausman: yep, we could schedule the reimage for say tue [06:41:40] I'll send the mail today, schedule the reinstall for Tue morning UTC [06:45:43] 10Analytics, 10Analytics-Kanban: Improve discovery of paths to delete in refinery-drop-older-than - https://phabricator.wikimedia.org/T263495 (10elukey) It might be something off between journald and python logging, because I can see for my re-run: ` 2020-10-02T06:18:43 INFO Unit tests passed. 2020-10-02T06... [06:49:46] elukey: The talk about migrating from hdfs-2.7 to hdfs-3.3 was very interesting yesterday [06:49:54] !log add an-worker110[0-2] to the hadoop cluster [06:49:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:49:55] elukey: I took some note [06:50:00] ah nice! [07:05:26] PROBLEM - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:05:30] yep [07:09:29] all GPU nodes are in service, we'll need to reboot them to properly configure the GPU, but I'll do it next week :) [07:14:20] in the meantime, 2.65PB available on HDFS [07:14:21] :D [07:14:50] in theory with all 16 nodes in we should cross the 3PB mark [07:14:52] New nodes for the win :) [07:14:59] but then I'll have to remove 16 for OOW [07:15:05] so the joy will not last :D [07:15:28] * joal thinks of the apple talk where they mentioned having a 140PB cluster, and needed a second, and then a third [07:16:12] I am wondering what they are doing in terms of federation etc.. [07:16:32] interesting question [07:17:00] I mean they have multiple 140PB clusters, so I guess they have a storage team dedicated only to hdfs [07:17:14] probably with hadoop committers [07:17:44] elukey: the fact that they mention having muliple clusters makes me feel they don't do federation (they'd have a single one?) - but I might awefully wrong [07:19:41] joal: maybe 140PB is some scalability threshold that they had, but having a single namenode for 140PB would be really challenging [07:20:07] even if they don't cross 300M files (I doubt it), there are the block reports from all datanodes, etc.. [07:22:21] elukey: indeed - this is smothing I have heard - the network aspect is not to be forgotten at those scales [07:25:02] I am reading https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/NodeLabel.html, really interesting [07:25:22] worth to open a task? It seems something that requires some review/testing [07:25:27] (for the GPU nodes I mean) [07:25:52] elukey: this is the first approach [07:26:17] elukey: it has problems, namely there is no control over multiple jobs trying to access GPUs [07:26:41] But it allows having GPUs on hadoop jobs, so worth a try [07:27:10] ah yes yes [07:27:22] but otherwise those gpus will sit there taking dust :D [07:27:36] elukey: I support trying :) [07:42:11] after this 6+16-16 run of workers, we'll add 24 more [07:42:23] that will be +1.152PB [07:43:07] good thing that on monday we'll put more RAM on the masters :D [07:54:43] great :) [07:55:11] elukey: let's also plan on the strategy to help reducing small files :) [07:56:09] joal: I thought we wanted ozone! :P [07:56:41] elukey: For sure we want ozone - AND we want bigger files :) [08:43:07] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) a:05RobH→03Cmjohnson @Cmjohnson an-worker1111 seems to be in the wrong rack: cloudsw1-c8-eqiad.mgmt.eqiad.wmnet https://librenms.wikimedia.... [08:52:04] elukey: one of the thing we should really not forget fron yesterday talk on HDFS: distcp is LONG, and discp doesn't work well with very large folders it's better to split them [08:53:33] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10JAllemandou) >>! In T258047#6511170, @Nuria wrote: > @MMiller_WMF we missed this month deploy of this change, will it be oK to wait for the run of November 1st or you needed it... [08:54:46] joal: :( [08:55:00] elukey: yeah - at least we know [09:22:58] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Put 6 GPU-based Hadoop worker in service - https://phabricator.wikimedia.org/T255138 (10elukey) All nodes joined the cluster, now we only need to reboot them (one by one) to enable the GPUs (some settings need a reboot). After this we'll need to... [09:23:47] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh 16 nodes in the Hadoop Analytics cluster - https://phabricator.wikimedia.org/T255140 (10elukey) [09:28:40] 10Analytics: Configure Yarn to be able to locate nodes with a GPU - https://phabricator.wikimedia.org/T264401 (10elukey) [09:28:44] * elukey afk for a bit [09:42:31] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, 10Platform Team Initiatives (API Gateway): AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Naike) [09:55:36] 10Analytics: Upgrade AMD ROCm drivers/tools to latest upstream - https://phabricator.wikimedia.org/T264408 (10elukey) [09:57:20] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) [09:59:44] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) an-worker1117 is fixed, it was preferring to PXE boot as opposed to boot from disk, so the loop was endless. [10:00:13] elukey: For T264408, what host would be the best to test a new version on? [10:00:14] T264408: Upgrade AMD ROCm drivers/tools to latest upstream - https://phabricator.wikimedia.org/T264408 [10:01:34] klausman: good question, not sure.. we could use a drained hadoop worker, but the os is different from stat100x (even if kernels are similar) [10:02:05] ideal would be a machine that still sees GPU use, but is also not the most important one [10:02:39] It seems the GPU on 5 is barely used at all [10:04:03] recently yes since it is completely stuck :D [10:04:22] my main concern is that with dkms there might be the need of reboots [10:04:42] I was looking at the last 7 days, 0 use. [10:05:08] ack [10:06:19] The other thing that worries me is the ability to go back. I know that with apt, backdating packages is a royal pain [10:08:27] in theory it should be doable, we have separate apt component for each rocm release [10:08:59] so a rollback should be something close to purge/reinstall packages [10:09:23] Good point [10:09:58] So the plan is to add the latest rocm as a new component, add that to (say) 1005 via puppet, purge the old stuff, install the new stuff and see what falls over. [10:10:38] exactly [10:11:05] this is what I have been doing so far https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Upgrade_the_Debian_packages [10:24:59] Made a patchset for adding 3.8 to reprepro [10:30:53] klausman: check what we do for 33, there are some other settings to add in the apt-repo [10:31:10] a couple of lines I mean, to allow to check/update packages etc.. [10:32:53] see we talk about GPUs and miriam joins the chan :D [10:34:22] :D miriam is technically off today, but please elukey let me know if there is anything I can do to help!! Thanks for all the work :) [10:35:49] miriam: nothing to do don't worry :) [10:36:42] elukey: The docs you sent mention to add the block I have in the patchset, then run puppet and then add the dep stuff. I was not aware there was something else. [10:39:42] Oh, you mean modules/aptrepo/files/distributions-wikimedia [10:40:57] yep if it is not on the doc please add it [10:41:16] going afk for lunch, ttl! [10:41:19] So, hmm. Would we only add this to Buster for now and decide Stretch later (or skip it)? Or go the whole way now? [10:41:38] it depends where we want to test it :D [10:41:47] 1005, I'd say [10:42:17] yeah but what if dkms causes the host to hang or if you need to reboot a couple of times due to some issue? [10:42:23] it will disrupt people working [10:42:31] this is my main concern [10:42:48] the only viable solution could be to set up a maintenance window, that could work [10:42:49] Well, I don't know of a place where that wouldn't be the case. And 1005 is hung anyway, it'll need a reboot soon either way [10:43:00] but with the usual two days of warning etc.. [10:43:16] Yes, I could announce today and do the deed on Monday or Tuesday [10:43:56] I would do some research first, it might cause stuff like tensorflow compatibility to change etc.. [10:44:21] Hrmm. Good point. [10:44:22] the last time it was tf 1 to tf2 so really impactful for users (and I had to wait), this time should be ok but let's double check first [10:44:35] I'll do some changelog reading [10:44:36] I was proposing the worker since once drained we can really have less constraints [10:44:39] ack [10:44:39] ttl! [11:51:27] morning! [11:51:43] Hi fdans [12:57:42] yay tea delivery! [13:00:19] special kinds?? [13:00:33] !log add an-worker110[6-9] to the Hadoop cluster [13:00:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:03:35] 1kg of Earl grey and 150g from the only tea plantation in NZ [13:04:01] (https://zealong.com/) [13:05:20] wow [13:06:24] Yeah, not cheap, but we'll see if it's worth it [13:07:17] * joal remembers the visit of tea plantations in Sri Lanka [13:41:48] elukey: joal whenever yall have a minute could you take a look at this puppet patch? [13:41:48] https://gerrit.wikimedia.org/r/c/operations/puppet/+/629409 [13:42:08] reading [13:44:26] fdans: I'm sorry my memory on the pageview_complete topic is not accurate - This dataset is to replace pagecount-ez, right? [13:44:42] joal: that's correct :) [13:45:31] Second memory backup fdans please: You have recomputed/reformatted all pagecount-ez to the new format (therefore the backfilling jobs), and this sync is for the whole dataset (including new when generated every hour) [13:46:15] joal: yes, this takes it from its location in hdfs to the dumps host [13:46:21] ah this is also interesting for me, if this is done then the oozie db increase should stop right? [13:46:35] :) [13:47:19] Thanks fdans for the reminder [13:47:40] mforns: o/ - let me know when you are around / have a moment [13:48:19] elukey: not exactly. Right now there's 2.5 years that have not yet been backfilled because I detected an inconsistency with the original dumps, so the backfilling is stopped until I solve that [13:49:00] ahh [13:49:06] so still stopped ok [13:49:14] elukey: this puppet patch has nothing to do with oozie, it is only to set up the rsync to the dumps hosts [13:49:17] can you tell me something when you restart? [13:49:25] elukey: yes for sure [13:49:40] fdans: yep yep I got it, I was only curious about the db increase :) [13:50:09] pcc looks good, we can merge if joal is ok [13:50:12] elukey: has it been increasing over the last couple days? [13:50:25] fdans: The related oozie job for regular data generation is pageview-daily_dump - correct? [13:50:40] yes [13:51:06] ok - I question having the job running hourly then [13:51:09] fdans: --^ [13:51:52] fdans: much slower pace than before - https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&var-server=an-coord1001&var-datasource=eqiad%20prometheus%2Fops&refresh=5m&from=now-14d&to=now [13:52:01] hi elukey just joined [13:52:15] hola hola [13:52:38] joal: oh right [13:53:07] I set it running hourly thinking about the historical job, which is hourly, but this should be ran daily [13:53:49] fdans: even if not that expensive as no data is being copied, hourly check for changes over a few thousand folders while we know no folder hass changed is not needed :) [13:54:46] fdans: I advise running the sync job early human-morning (~5am) - the probability of new data being present would be higher [13:55:24] joal: yes that makes sense, will update CR shortly, thank you for the review [13:56:00] mforns: https://phabricator.wikimedia.org/T263495#6511231 - not urgent, but it may be related to yesterdays' changes [13:56:19] elukey: yes I saw that [13:56:24] it's weird! [13:56:54] I saw your comment about the ordering of the list in the test, but the code should return always the same order I think... [13:57:40] elukey: oh, actually, on a second thought, there's some partial ordering issues that could happen. [13:57:48] great, will fix that [13:58:46] mforns: ah yes it was only a very ignorant comment, not sure if it made sense or not :D [13:58:57] elukey: it did, actually! [14:01:41] elukey, fdans - Dropping for kids - once the daily timing is fixed, it's good for me to go [14:02:05] fdans: maybe you should ping Brooke, as she is a reviewer but didn't comment? Just saying :) [14:02:15] See you in ~2h fols [14:02:35] (03PS1) 10Mforns: Fix ordering issue in refinery-drop-older-than test [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631780 (https://phabricator.wikimedia.org/T263495) [14:06:17] mforns: we can live test this on stat1007 --^ [14:06:32] elukey: just tested in an-launcher [14:06:53] sure I mean if it fixes the current timer failed on 1007 [14:07:08] it's a very minor change, the program was failing to pass the tests because an undeterminism in the hdfs mock [14:07:16] oh, ok [14:07:36] let me fetch the command [14:13:05] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10Patch-For-Review, and 2 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10CBogen) > Is there a plan to bring MediaSearch to other wikis in the future, or will it b... [14:13:11] team: I have a bunch of furniture to move so I'm going to be afk for a couple hours, will be back around 4pm UTC [14:20:07] elukey: OK yes the fix works, now search-drop-query-clicks works fine. I think we can merge the patch. [14:20:15] elukey: looking now into the other error. [14:21:29] mforns: lovely [14:21:49] (03CR) 10Elukey: [C: 03+2] Fix ordering issue in refinery-drop-older-than test [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631780 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [14:21:57] (03CR) 10Elukey: [V: 03+2 C: 03+2] Fix ordering issue in refinery-drop-older-than test [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631780 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [14:23:59] !log live patch refinery-drop-older-than on stat1007 to unblock timer (patch https://gerrit.wikimedia.org/r/6317800) [14:24:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:52:23] mforns: if the bug is too difficult to narrow down we could think about a quick rollback + refinery deploy without hdfs, to make the script running [14:54:58] elukey: I found what is happening [14:55:39] (03PS1) 10Milimetric: [WIP] Refactor state for cleanliness and consistency [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/631791 (https://phabricator.wikimedia.org/T262725) [14:56:45] with the new code, the program might try to hdfs.ls too big of a tree, IF the path_format regular expression is wrong, OR the regular expression excludes a big enough portion of the tree... [14:57:29] that's the case of drop-el-unsanitized-events, where the regexp excludes all mediawiki_job data sets within the base_path tree [14:58:13] the script tries to recursive ls those subtrees to find matches in the regexp, but it can't, thus leading to ls the whole tree [14:58:55] it's a difficult problem [15:00:29] mforns: i am not sure i understand , the regex there excludes mediawiki tables and what is the problem it causes? [15:00:48] mforns: wait i have a 1 on 1 , i can talk later [15:00:54] nuria: ok [15:02:21] (03Abandoned) 10Milimetric: [WIP] Clean up data flow as pertains to state [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/604387 (owner: 10Milimetric) [15:35:37] 10Analytics-Radar, 10Technical-blog-posts: Story idea for Blog: The Best Dataset on Wikimedia Content and Contributors - https://phabricator.wikimedia.org/T259559 (10srodlund) @Milimetric -- @bd808 was able to fix the code syntax highlighter on the blog's editor, and I applied this to the two main blocks on yo... [15:49:38] 10Analytics-Radar, 10Technical-blog-posts: Story idea for Blog: The Best Dataset on Wikimedia Content and Contributors - https://phabricator.wikimedia.org/T259559 (10Milimetric) @srodlund looks awesome, thanks to you and Bryan :) [15:52:38] 10Analytics, 10Platform Team Sprints Board (Sprint 5), 10Platform Team Workboards (Green): Ingest api-gateway.request events to turnillo - https://phabricator.wikimedia.org/T261002 (10WDoranWMF) [15:53:09] 10Analytics, 10MediaWiki-REST-API, 10Patch-For-Review, 10Platform Team Sprints Board (Sprint 5), and 2 others: System administrator reviews API usage by client - https://phabricator.wikimedia.org/T251812 (10WDoranWMF) [15:54:20] 10Analytics, 10Event-Platform, 10EventStreams: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Jdlrobson) [16:26:22] (03PS1) 10Mforns: Fix directory expansion bug in refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495) [16:27:04] elukey, nuria, razzi: ^ [16:27:46] this should fix the thing. I added a (hacky) method to check for partial matches. Added unit tests, and tested with real data. Seems to work. [16:31:39] 10Analytics-Radar, 10Technical-blog-posts: Story idea for Blog: The Best Dataset on Wikimedia Content and Contributors - https://phabricator.wikimedia.org/T259559 (10srodlund) 05Open→03Resolved a:03srodlund Yay! Announced on Twitter and resolving this ticket! Thanks for all your work on this! It's a real... [16:37:46] fdans: heya - currently writing an email about the mediawiki-history oozie job failed - I'm relaunching it from hue and will investigate as it is the second time happens [16:39:16] fdans: I was monitoring it as it failed last month IIRC [16:39:44] mforns: very nice explanation in the commit msg [16:40:20] hehe elukey you read it to the end, you're my hero [16:40:40] well you are for explaing the problem in that detail :) [16:41:10] one thing that I want to understand: can you tell me a bit more about hdfs.ls() leading to thousand of results? [16:41:30] (I am not familiar with how it behaves in our code, I can RTFM myself in case :D) [16:41:44] elukey: quick quexstion for you - how do I need to restart an oozie job with hue with prod user? [16:42:25] nono in theory it should be possible with hue-next, the error that Marcel was getting SHOULD be related to mysql connections exhausted [16:42:39] if you have a moment to check that I am not crazy I'd be grateful [16:42:39] Ah ok - Trying [16:42:55] mforns: I meant "This makes it hdfs.ls() the whole tree, which can contain tens of [16:42:58] thousands of sub-paths" [16:43:28] !log Rerun mediawiki-history-denormalize-wf-2020-09 after failed instance [16:43:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:44:04] I am asking since I'd like to make sure that hdfs.ls() is doing the right thing, like not recursively showing an entire subtree [16:44:20] elukey: no no, hdfs.ls is fine [16:44:52] elukey: but the script was trying to hdfs.ls(lots and lots of directories) [16:45:37] elukey: i.e.: hdfs.ls([path1, path2, ... path80000]) [16:46:28] ahhh all in once [16:46:34] yes [16:46:49] okok [16:47:54] mforns: is there a reason why it does ls with multiple paths? (again me ignorant sorry) [16:48:19] it's ls-ing the directory tree by depth level [16:48:47] elukey: it could do it one by one, but this way is more efficient I believe [16:48:59] ok so it was for performance [16:49:01] okok [16:49:03] yea [16:59:09] mforns: here I am sorry - ok so the fix is basically avoid adding dirs that matches the regex, while the script expands the subtrees [16:59:12] IIUC [16:59:34] so when hdfs.ls() runs it should be more manageable [17:00:11] elukey: yes, it avoids adding all the dirs that do *not* match the regex, so the hdfs.ls() is more manageable [17:00:32] ah yes yes sorry, not [17:00:41] from the dirs that do not match the full regex, it only will add those that match it partially [17:00:54] it seems a good fix for the moment, I am wondering what is the limit and if there is the risk of hitting it again in the future [17:01:06] mforns: I second elukey +1'ing the descriptive commit message. IMO it's a good time to reexamine the use cases of this script and see if there's a way that it can do what it needs to while relying less on complex regex. As I understand, use cases that start with `--path-format='.+/ ...` will always pass the new `path_is_partial_match` check, so that is would only be a partial solution [17:02:35] razzi: that could be solved when adding new deletion jobs, do not use '.+/', instead use '[^/]+' [17:03:18] we could also explore a depth-first approach for dir traversal, and see if python+asyncio/gevent/etc.. could lead to acceptable perfs [17:03:30] (most of the time IIUC the script waits for I/O from hdfs or hive) [17:06:33] mforns: we can go ahead with this fix in my opinion, and maybe brainstorm about a long term solution? What do you think? [17:06:35] elukey: you mean using depth-first instead of the "threshold"-pruning? [17:07:19] mforns: yep I mean adding a boundary on the maximum number of paths to ls() in one go [17:07:25] if even possible [17:07:58] elukey: I don't see performance problems after this fix. It is true, as razzi says, that the regexp parameter is complicated, but it was like that since we created this script [17:17:57] 10Analytics, 10Operations: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10herron) p:05Triage→03Medium [17:35:36] mforns yep yep I agree that the fix is good, what I was wondering is if in the future, maybe with more complex subtrees, we could end up in a situation in which even if with the regex the script hits the max arg list [17:35:54] but we can tackle the problem if it presents itself in the future [17:36:19] (I don't want to nitpick just reason out loud with ideas, don't feel blocked) [17:36:48] elukey: if we use regexes carefully, like the example razzi and I were discussing (use '[^/]+/' instead of '.+/'), I think this should not give more issues in the future [17:38:00] however, I agree with razzi, that the regexp argument is not ideal, and we can think of solving this in the future [17:38:20] yep you two can follow up on what's best :) [17:39:49] yea, sounds goo [17:39:53] *good :] [17:41:03] dumb idea mforns, elukey, razzi - Would rewriting the algo in jvm world make our life easier through having better HDFS api? [17:41:28] * joal run and hides into jvm world [17:41:34] heheheh [17:43:30] I think the problem that we have now is the difficult regular expression argument, which is probably unrelated to language. But yea, we can consider moving to Scala? not Java please! [17:43:34] :P [17:44:03] You have my entire support for scala-not-java mforns :) [17:47:34] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, 10Patch-For-Review: MEP Client MediaWiki PHP - https://phabricator.wikimedia.org/T253121 (10Mholloway) Per discussion in the team meeting earlier this week, this can live on the PID workboard only rather than also PI core... [18:01:15] mforns, joal : +1 to mforns (and razzi) that issue is what arguments and the way the tree is traversed not language [18:01:32] works for me :) [18:06:39] * elukey afk! [18:10:38] (03CR) 10Razzi: Fix directory expansion bug in refinery-drop-older-than (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [18:20:28] (03PS2) 10Mforns: Fix directory expansion bug in refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495) [18:20:46] (03CR) 10Mforns: Fix directory expansion bug in refinery-drop-older-than (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [18:23:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10MMiller_WMF) Oh okay, great! @Miriam -- could you please check to see that the data you need is there? Then we can resolve the task. [21:21:36] mforns: ok, back on this i understand the issue with dropping script now [21:43:44] 10Analytics-Clusters, 10Operations, 10decommission-hardware, 10ops-eqiad: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10Dzahn) The hosts here are showing up in a weird state. When running the DNS cookbook you get warnings that these hosts exist but are not "in devices... [22:30:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Improve discovery of paths to delete in refinery-drop-older-than - https://phabricator.wikimedia.org/T263495 (10Nuria) FYI, that i reruned this timer as it is now deployed on an-launcher1002 (with only the 'order' fix but not the subsequent fix) and it wor... [22:35:29] (03CR) 10Nuria: Fix directory expansion bug in refinery-drop-older-than (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [23:13:21] 10Analytics, 10Operations, 10Research-Backlog, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila) [23:33:27] 10Analytics, 10Operations, 10Research-Backlog, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Nuria) My opinion on this request is that having non throughly supervised contributors accessing data introduces...