[00:49:57] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:53:56] <icinga-wm>	 PROBLEM - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:11:33] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10Nuria) @MMiller_WMF we missed this month deploy of this change, will it be oK to wait for the run of November 1st or you needed it sooner?
[06:05:03] <elukey>	 klausman: good morning :) I am deeply sorry for the stat100x, I forgot to check at the start the dhcp/pxe config, 1004/6/7 have been re-installed with stretch :(
[06:05:38] <elukey>	 and nobody really thought to check /etc/debian_version or similar after 1004 :(
[06:17:31] <joal>	  /o\
[06:17:43] * joal feels super sorry :S
[06:19:32] <elukey>	 we need to redo the work again
[06:21:05] <elukey>	 anyway, trying to re-run drop-el-unsanitized-events.service on launcher 
[06:21:49] <joal>	 ack - launching a manual sqoop for page_props and user_properties
[06:21:56] <icinga-wm>	 RECOVERY - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is OK: OK: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:25:14] <elukey>	 this is still running --^
[06:25:19] <elukey>	 now the weird things is
[06:25:20] <elukey>	 Oct 02 00:00:01 an-launcher1002 systemd[1]: Started Drop unsanitized EventLogging data from the event database after retention period..
[06:25:23] <elukey>	 Oct 02 00:00:01 an-launcher1002 kerberos-run-command[21654]: User analytics executes as user analytics the command
[06:25:28] <elukey>	 Oct 02 00:46:07 an-launcher1002 kerberos-run-command[21654]: ........................
[06:25:31] <elukey>	 Oct 02 00:46:07 an-launcher1002 kerberos-run-command[21654]: ----------------------------------------------------------------------
[06:25:33] <elukey>	 Oct 02 00:46:07 an-launcher1002 kerberos-run-command[21654]: Ran 24 tests in 0.020s
[06:25:39] <elukey>	 46 minutes?
[06:25:52] <elukey>	 (then it appears an error later on)
[06:26:24] <joal>	 hm
[06:28:04] <elukey>	 some patches were merged yesterday for the script https://phabricator.wikimedia.org/T263495
[06:28:11] <elukey>	 it is probably related
[06:28:26] <joal>	 I assume it is elukey 
[06:28:28] <elukey>	 the script ends up with hdfs ls listing too many files and failing
[06:29:32] <joal>	 elukey: IIRC the fix dpeloyed yesterday was to prevent this case preciselk
[06:31:17] <klausman>	 elukey: wait, what? When I reinstalled them, they went to stretch?
[06:31:30] <elukey>	 klausman: yes... 
[06:31:41] <klausman>	 Oh man. Well, at least it was good exercise :D
[06:31:42] <elukey>	 we didn't really check what os was installed
[06:32:03] <klausman>	 At least this time around, we can probably skip the backups?
[06:32:05] <elukey>	 I am really sorry, it didn't occur to me to triple check
[06:32:19] <klausman>	 I missed it as well, don't beat yourself up over it
[06:32:20] <elukey>	 in theory yes, the reimage seems pretty solid
[06:32:56] <klausman>	 We still have the 1007 backup, if we do the real deal with that one first, we can be sure the actual buster install is fine without risking too much
[06:33:50] <klausman>	 Should I send out a warning mail now about 1007 being reinstalled on Monday?
[06:33:51] <joal>	 !log Manually sqoop page_props and user_properties to unlock mediawiki-history-load oozie job
[06:33:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:36:08] <wikibugs>	 10Analytics, 10Analytics-Kanban: Improve discovery of paths to delete in refinery-drop-older-than - https://phabricator.wikimedia.org/T263495 (10elukey) There are two alarms in icinga right now:  * stat1007 - search-drop-query-clicks.service  ` Oct 02 03:30:03 stat1007 kerberos-run-command[49936]: ------------...
[06:38:37] <elukey>	 klausman: yep, we could schedule the reimage for say tue
[06:41:40] <klausman>	 I'll send the mail today, schedule the reinstall for Tue morning UTC
[06:45:43] <wikibugs>	 10Analytics, 10Analytics-Kanban: Improve discovery of paths to delete in refinery-drop-older-than - https://phabricator.wikimedia.org/T263495 (10elukey) It might be something off between journald and python logging, because I can see for my re-run:  ` 2020-10-02T06:18:43 INFO   Unit tests passed. 2020-10-02T06...
[06:49:46] <joal>	 elukey: The talk about migrating from hdfs-2.7 to hdfs-3.3 was very interesting yesterday
[06:49:54] <elukey>	 !log add an-worker110[0-2] to the hadoop cluster
[06:49:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:49:55] <joal>	 elukey: I took some note
[06:50:00] <elukey>	 ah nice!
[07:05:26] <icinga-wm>	 PROBLEM - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:05:30] <elukey>	 yep
[07:09:29] <elukey>	 all GPU nodes are in service, we'll need to reboot them to properly configure the GPU, but I'll do it next week :)
[07:14:20] <elukey>	 in the meantime, 2.65PB available on HDFS
[07:14:21] <elukey>	 :D
[07:14:50] <elukey>	 in theory with all 16 nodes in we should cross the 3PB mark
[07:14:52] <joal>	 New nodes for the win :)
[07:14:59] <elukey>	 but then I'll have to remove 16 for OOW
[07:15:05] <elukey>	 so the joy will not last :D
[07:15:28] * joal thinks of the apple talk where they mentioned having a 140PB cluster, and needed a second, and then a third
[07:16:12] <elukey>	 I am wondering what they are doing in terms of federation etc..
[07:16:32] <joal>	 interesting question
[07:17:00] <elukey>	 I mean they have multiple 140PB clusters, so I guess they have a storage team dedicated only to hdfs
[07:17:14] <elukey>	 probably with hadoop committers
[07:17:44] <joal>	 elukey: the fact that they mention having muliple clusters makes me feel they don't do federation (they'd have a single one?) - but I might awefully wrong
[07:19:41] <elukey>	 joal: maybe 140PB is some scalability threshold that they had, but having a single namenode for 140PB would be really challenging
[07:20:07] <elukey>	 even if they don't cross 300M files (I doubt it), there are the block reports from all datanodes, etc..
[07:22:21] <joal>	 elukey: indeed - this is smothing I have heard - the network aspect is not to be forgotten at those scales
[07:25:02] <elukey>	 I am reading https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/NodeLabel.html, really interesting
[07:25:22] <elukey>	 worth to open a task? It seems something that requires some review/testing
[07:25:27] <elukey>	 (for the GPU nodes I mean)
[07:25:52] <joal>	 elukey: this is the first approach
[07:26:17] <joal>	 elukey: it has problems, namely there is no control over multiple jobs trying to access GPUs
[07:26:41] <joal>	 But it allows having GPUs on hadoop jobs, so worth a try
[07:27:10] <elukey>	 ah yes yes
[07:27:22] <elukey>	 but otherwise  those gpus will sit there taking dust :D
[07:27:36] <joal>	 elukey: I support trying :)
[07:42:11] <elukey>	 after this 6+16-16 run of workers, we'll add 24 more
[07:42:23] <elukey>	 that will be +1.152PB 
[07:43:07] <elukey>	 good thing that on monday we'll put more RAM on the masters :D
[07:54:43] <joal>	 great :)
[07:55:11] <joal>	 elukey: let's also plan on the strategy to help reducing small files :)
[07:56:09] <elukey>	 joal: I thought we wanted ozone! :P
[07:56:41] <joal>	 elukey: For sure we want ozone - AND we want bigger files :)
[08:43:07] <wikibugs>	 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) a:05RobH→03Cmjohnson @Cmjohnson an-worker1111 seems to be in the wrong rack: cloudsw1-c8-eqiad.mgmt.eqiad.wmnet https://librenms.wikimedia....
[08:52:04] <joal>	 elukey: one of the thing we should really not forget fron yesterday talk on HDFS: distcp is LONG, and discp doesn't work well with very large folders it's better to split them
[08:53:33] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10JAllemandou) >>! In T258047#6511170, @Nuria wrote: > @MMiller_WMF we missed this month deploy of this change, will it be oK to wait for the run of November 1st or you needed it...
[08:54:46] <elukey>	 joal: :(
[08:55:00] <joal>	 elukey: yeah - at least we know
[09:22:58] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Put 6 GPU-based Hadoop worker in service - https://phabricator.wikimedia.org/T255138 (10elukey) All nodes joined the cluster, now we only need to reboot them (one by one) to enable the GPUs (some settings need a reboot).  After this we'll need to...
[09:23:47] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh 16 nodes in the Hadoop Analytics cluster - https://phabricator.wikimedia.org/T255140 (10elukey)
[09:28:40] <wikibugs>	 10Analytics: Configure Yarn to be able to locate nodes with a GPU - https://phabricator.wikimedia.org/T264401 (10elukey)
[09:28:44] * elukey afk for a bit
[09:42:31] <wikibugs>	 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, 10Platform Team Initiatives (API Gateway): AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Naike)
[09:55:36] <wikibugs>	 10Analytics: Upgrade AMD ROCm drivers/tools to latest upstream - https://phabricator.wikimedia.org/T264408 (10elukey)
[09:57:20] <wikibugs>	 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey)
[09:59:44] <wikibugs>	 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) an-worker1117 is fixed, it was preferring to PXE boot as opposed to boot from disk, so the loop was endless.
[10:00:13] <klausman>	 elukey: For T264408, what host would be the best to test a new version on?
[10:00:14] <stashbot>	 T264408: Upgrade AMD ROCm drivers/tools to latest upstream - https://phabricator.wikimedia.org/T264408
[10:01:34] <elukey>	 klausman: good question, not sure.. we could use a drained hadoop worker, but the os is different from stat100x (even if kernels are similar)
[10:02:05] <klausman>	 ideal would be a machine that still sees GPU use, but is also not the most important one
[10:02:39] <klausman>	 It seems the GPU on 5 is barely used at all
[10:04:03] <elukey>	 recently yes since it is completely stuck :D
[10:04:22] <elukey>	 my main concern is that with dkms there might be the need of reboots
[10:04:42] <klausman>	 I was looking at the last 7 days, 0 use.
[10:05:08] <elukey>	 ack
[10:06:19] <klausman>	 The other thing that worries me is the ability to go back. I know that with apt, backdating packages is a royal pain
[10:08:27] <elukey>	 in theory it should be doable, we have separate apt component for each rocm release
[10:08:59] <elukey>	 so a rollback should be something close to purge/reinstall packages
[10:09:23] <klausman>	 Good point
[10:09:58] <klausman>	 So the plan is to add the latest rocm as a new component, add that to (say) 1005 via puppet, purge the old stuff, install the new stuff and see what falls over.
[10:10:38] <elukey>	 exactly
[10:11:05] <elukey>	 this is what I have been doing so far https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Upgrade_the_Debian_packages
[10:24:59] <klausman>	 Made a patchset for adding 3.8 to reprepro
[10:30:53] <elukey>	 klausman: check what we do for 33, there are some other settings to add in the apt-repo
[10:31:10] <elukey>	 a couple of lines I mean, to allow to check/update packages etc..
[10:32:53] <elukey>	 see we talk about GPUs and miriam joins the chan :D
[10:34:22] <miriam>	 :D miriam is technically off today, but please elukey let me know if there is anything I can do to help!! Thanks for all the work :)
[10:35:49] <elukey>	 miriam: nothing to do don't worry :)
[10:36:42] <klausman>	 elukey: The docs you sent mention to add the block I have in the patchset, then run puppet and then add the dep stuff. I was not aware there was something else.
[10:39:42] <klausman>	 Oh, you mean modules/aptrepo/files/distributions-wikimedia
[10:40:57] <elukey>	 yep if it is not on the doc please add it
[10:41:16] <elukey>	 going afk for lunch, ttl!
[10:41:19] <klausman>	 So, hmm. Would we only add this to Buster for now and decide Stretch later (or skip it)? Or go the whole way now?
[10:41:38] <elukey>	 it depends where we want to test it :D
[10:41:47] <klausman>	 1005, I'd say
[10:42:17] <elukey>	 yeah but what if dkms causes the host to hang or if you need to reboot a couple of times due to some issue?
[10:42:23] <elukey>	 it will disrupt people working
[10:42:31] <elukey>	 this is my main concern
[10:42:48] <elukey>	 the only viable solution could be to set up a maintenance window, that could work
[10:42:49] <klausman>	 Well, I don't know of a place where that wouldn't be the case. And 1005 is hung anyway, it'll need a reboot soon either way
[10:43:00] <elukey>	 but with the usual two days of warning etc..
[10:43:16] <klausman>	 Yes, I could announce today and do the deed on Monday or Tuesday
[10:43:56] <elukey>	 I would do some research first, it might cause stuff like tensorflow compatibility to change etc..
[10:44:21] <klausman>	 Hrmm. Good point.
[10:44:22] <elukey>	 the last time it was tf 1 to tf2 so really impactful for users (and I had to wait), this time should be ok but let's double check first
[10:44:35] <klausman>	 I'll do some changelog reading
[10:44:36] <elukey>	 I was proposing the worker since once drained we can really have less constraints
[10:44:39] <elukey>	 ack
[10:44:39] <elukey>	 ttl!
[11:51:27] <fdans>	 morning!
[11:51:43] <joal>	 Hi fdans
[12:57:42] <klausman>	 yay tea delivery!
[13:00:19] <elukey>	 special kinds??
[13:00:33] <elukey>	 !log add an-worker110[6-9] to the Hadoop cluster
[13:00:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:03:35] <klausman>	  1kg of Earl grey and 150g from the only tea plantation in NZ
[13:04:01] <klausman>	 (https://zealong.com/)
[13:05:20] <elukey>	 wow
[13:06:24] <klausman>	 Yeah, not cheap, but we'll see if it's worth it
[13:07:17] * joal remembers the visit of tea plantations in Sri Lanka
[13:41:48] <fdans>	 elukey: joal whenever yall have a minute could you take a look at this puppet patch?
[13:41:48] <fdans>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/629409
[13:42:08] <joal>	 reading
[13:44:26] <joal>	 fdans: I'm sorry my memory on the pageview_complete topic is not accurate - This dataset is to replace pagecount-ez, right?
[13:44:42] <fdans>	 joal: that's correct :)
[13:45:31] <joal>	 Second memory backup fdans please: You have recomputed/reformatted all pagecount-ez to the new format (therefore the backfilling jobs), and this sync is for the whole dataset (including new when generated every hour)
[13:46:15] <fdans>	 joal: yes, this takes it from its location in hdfs to the dumps host
[13:46:21] <elukey>	 ah this is also interesting for me, if this is done then the oozie db increase should stop right?
[13:46:35] <joal>	 :)
[13:47:19] <joal>	 Thanks fdans for the reminder
[13:47:40] <elukey>	 mforns: o/ - let me know when you are around / have a moment
[13:48:19] <fdans>	 elukey: not exactly. Right now there's 2.5 years that have not yet been backfilled because I detected an inconsistency with the original dumps, so the backfilling is stopped until I solve that
[13:49:00] <elukey>	 ahh
[13:49:06] <elukey>	 so still stopped ok
[13:49:14] <fdans>	 elukey: this puppet patch has nothing to do with oozie, it is only to set up the rsync to the dumps hosts
[13:49:17] <elukey>	 can you tell me something when you restart?
[13:49:25] <fdans>	 elukey: yes for sure
[13:49:40] <elukey>	 fdans: yep yep I got it, I was only curious about the db increase :)
[13:50:09] <elukey>	 pcc looks good, we can merge if joal is ok
[13:50:12] <fdans>	 elukey: has it been increasing over the last couple days?
[13:50:25] <joal>	 fdans: The related oozie job for regular data generation is pageview-daily_dump - correct?
[13:50:40] <fdans>	 yes
[13:51:06] <joal>	 ok - I question having the job running hourly then
[13:51:09] <joal>	 fdans: --^
[13:51:52] <elukey>	 fdans: much slower pace than before - https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&var-server=an-coord1001&var-datasource=eqiad%20prometheus%2Fops&refresh=5m&from=now-14d&to=now
[13:52:01] <mforns>	 hi elukey just joined
[13:52:15] <elukey>	 hola hola
[13:52:38] <fdans>	 joal: oh right
[13:53:07] <fdans>	 I set it running hourly thinking about the historical job, which is hourly, but this should be ran daily
[13:53:49] <joal>	 fdans: even if not that expensive as no data is being copied, hourly check for changes over a few thousand folders while we know no folder hass changed is not needed :)
[13:54:46] <joal>	 fdans: I advise running the sync job early human-morning (~5am) - the probability of new data being present would be higher
[13:55:24] <fdans>	 joal: yes that makes sense, will update CR shortly, thank you for the review
[13:56:00] <elukey>	 mforns: https://phabricator.wikimedia.org/T263495#6511231 - not urgent, but it may be related to yesterdays' changes
[13:56:19] <mforns>	 elukey: yes I saw that
[13:56:24] <mforns>	 it's weird!
[13:56:54] <mforns>	 I saw your comment about the ordering of the list in the test, but the code should return always the same order I think...
[13:57:40] <mforns>	 elukey: oh, actually, on a second thought, there's some partial ordering issues that could happen.
[13:57:48] <mforns>	 great, will fix that
[13:58:46] <elukey>	 mforns: ah yes it was only a very ignorant comment, not sure if it made sense or not :D
[13:58:57] <mforns>	 elukey: it did, actually!
[14:01:41] <joal>	 elukey, fdans - Dropping for kids - once the daily timing is fixed, it's good for me to go 
[14:02:05] <joal>	 fdans: maybe you should ping Brooke, as she is a reviewer but didn't comment? Just saying :)
[14:02:15] <joal>	 See you in ~2h fols
[14:02:35] <wikibugs>	 (03PS1) 10Mforns: Fix ordering issue in refinery-drop-older-than test [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631780 (https://phabricator.wikimedia.org/T263495)
[14:06:17] <elukey>	 mforns: we can live test this on stat1007 --^
[14:06:32] <mforns>	 elukey: just tested in an-launcher
[14:06:53] <elukey>	 sure I mean if it fixes the current timer failed on 1007
[14:07:08] <mforns>	 it's a very minor change, the program was failing to pass the tests because an undeterminism in the hdfs mock
[14:07:16] <mforns>	 oh, ok
[14:07:36] <mforns>	 let me fetch the command
[14:13:05] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10Patch-For-Review, and 2 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10CBogen) > Is there a plan to bring MediaSearch to other wikis in the future, or will it b...
[14:13:11] <fdans>	 team: I have a bunch of furniture to move so I'm going to be afk for a couple hours, will be back around 4pm UTC
[14:20:07] <mforns>	 elukey: OK yes the fix works, now search-drop-query-clicks works fine. I think we can merge the patch.
[14:20:15] <mforns>	 elukey: looking now into the other error.
[14:21:29] <elukey>	 mforns: lovely
[14:21:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Fix ordering issue in refinery-drop-older-than test [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631780 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns)
[14:21:57] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Fix ordering issue in refinery-drop-older-than test [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631780 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns)
[14:23:59] <elukey>	 !log live patch refinery-drop-older-than on stat1007 to unblock timer (patch https://gerrit.wikimedia.org/r/6317800)
[14:24:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:52:23] <elukey>	 mforns: if the bug is too difficult to narrow down we could think about a quick rollback + refinery deploy without hdfs, to make the script running
[14:54:58] <mforns>	 elukey: I found what is happening
[14:55:39] <wikibugs>	 (03PS1) 10Milimetric: [WIP] Refactor state for cleanliness and consistency [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/631791 (https://phabricator.wikimedia.org/T262725)
[14:56:45] <mforns>	 with the new code, the program might try to hdfs.ls too big of a tree, IF the path_format regular expression is wrong, OR the regular expression excludes a big enough portion of the tree...
[14:57:29] <mforns>	 that's the case of drop-el-unsanitized-events, where the regexp excludes all mediawiki_job data sets within the base_path tree
[14:58:13] <mforns>	 the script tries to recursive ls those subtrees to find matches in the regexp, but it can't, thus leading to ls the whole tree
[14:58:55] <mforns>	 it's a difficult problem
[15:00:29] <nuria>	 mforns: i am not sure i understand , the regex  there excludes mediawiki tables and what is the problem it causes?
[15:00:48] <nuria>	 mforns: wait i have a 1 on 1 , i can talk later
[15:00:54] <mforns>	 nuria: ok
[15:02:21] <wikibugs>	 (03Abandoned) 10Milimetric: [WIP] Clean up data flow as pertains to state [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/604387 (owner: 10Milimetric)
[15:35:37] <wikibugs>	 10Analytics-Radar, 10Technical-blog-posts: Story idea for Blog: The Best Dataset on Wikimedia Content and Contributors - https://phabricator.wikimedia.org/T259559 (10srodlund) @Milimetric -- @bd808 was able to fix the code syntax highlighter on the blog's editor, and I applied this to the two main blocks on yo...
[15:49:38] <wikibugs>	 10Analytics-Radar, 10Technical-blog-posts: Story idea for Blog: The Best Dataset on Wikimedia Content and Contributors - https://phabricator.wikimedia.org/T259559 (10Milimetric) @srodlund looks awesome, thanks to you and Bryan :)
[15:52:38] <wikibugs>	 10Analytics, 10Platform Team Sprints Board (Sprint 5), 10Platform Team Workboards (Green): Ingest api-gateway.request events to turnillo - https://phabricator.wikimedia.org/T261002 (10WDoranWMF)
[15:53:09] <wikibugs>	 10Analytics, 10MediaWiki-REST-API, 10Patch-For-Review, 10Platform Team Sprints Board (Sprint 5), and 2 others: System administrator reviews API usage by client - https://phabricator.wikimedia.org/T251812 (10WDoranWMF)
[15:54:20] <wikibugs>	 10Analytics, 10Event-Platform, 10EventStreams: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Jdlrobson)
[16:26:22] <wikibugs>	 (03PS1) 10Mforns: Fix directory expansion bug in refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495)
[16:27:04] <mforns>	 elukey, nuria, razzi: ^
[16:27:46] <mforns>	 this should fix the thing. I added a (hacky) method to check for partial matches. Added unit tests, and tested with real data. Seems to work.
[16:31:39] <wikibugs>	 10Analytics-Radar, 10Technical-blog-posts: Story idea for Blog: The Best Dataset on Wikimedia Content and Contributors - https://phabricator.wikimedia.org/T259559 (10srodlund) 05Open→03Resolved a:03srodlund Yay! Announced on Twitter and resolving this ticket! Thanks for all your work on this! It's a real...
[16:37:46] <joal>	 fdans: heya - currently writing an email about the mediawiki-history oozie job failed - I'm relaunching it from hue and will investigate as it is the second time happens
[16:39:16] <joal>	 fdans: I was monitoring it as it failed last month IIRC
[16:39:44] <elukey>	 mforns: very nice explanation in the commit msg
[16:40:20] <mforns>	 hehe elukey you read it to the end, you're my hero
[16:40:40] <elukey>	 well you are for explaing the problem in that detail :)
[16:41:10] <elukey>	 one thing that I want to understand: can you tell me a bit more about hdfs.ls() leading to thousand of results?
[16:41:30] <elukey>	 (I am not familiar with how it behaves in our code, I can RTFM myself in case :D)
[16:41:44] <joal>	 elukey: quick quexstion for you - how do I need to restart an oozie job with hue with prod user?
[16:42:25] <elukey>	 nono in theory it should be possible with hue-next, the error that Marcel was getting SHOULD be related to mysql connections exhausted
[16:42:39] <elukey>	 if you have a moment to check that I am not crazy I'd be grateful
[16:42:39] <joal>	 Ah ok - Trying
[16:42:55] <elukey>	 mforns: I meant "This makes it hdfs.ls() the whole tree, which can contain tens of
[16:42:58] <elukey>	 thousands of sub-paths"
[16:43:28] <joal>	 !log Rerun  mediawiki-history-denormalize-wf-2020-09 after failed instance
[16:43:30] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:44:04] <elukey>	 I am asking since I'd like to make sure that hdfs.ls() is doing the right thing, like not recursively showing an entire subtree
[16:44:20] <mforns>	 elukey: no no, hdfs.ls is fine
[16:44:52] <mforns>	 elukey: but the script was trying to hdfs.ls(lots and lots of directories)
[16:45:37] <mforns>	 elukey: i.e.: hdfs.ls([path1, path2, ... path80000])
[16:46:28] <elukey>	 ahhh all in once
[16:46:34] <mforns>	 yes
[16:46:49] <elukey>	 okok
[16:47:54] <elukey>	 mforns: is there a reason why it does ls with multiple paths? (again me ignorant sorry)
[16:48:19] <mforns>	 it's ls-ing the directory tree by depth level
[16:48:47] <mforns>	 elukey: it could do it one by one, but this way is more efficient I believe
[16:48:59] <elukey>	 ok so it was for performance
[16:49:01] <elukey>	 okok
[16:49:03] <mforns>	 yea
[16:59:09] <elukey>	 mforns: here I am sorry - ok so the fix is basically avoid adding dirs that matches the regex, while the script expands the subtrees
[16:59:12] <elukey>	 IIUC
[16:59:34] <elukey>	 so when hdfs.ls() runs it should be more manageable
[17:00:11] <mforns>	 elukey: yes, it avoids adding all the dirs that do *not* match the regex, so the hdfs.ls() is more manageable
[17:00:32] <elukey>	 ah yes yes sorry, not
[17:00:41] <mforns>	 from the dirs that do not match the full regex, it only will add those that match it partially
[17:00:54] <elukey>	 it seems a good fix for the moment, I am wondering what is the limit and if there is the risk of hitting it again in the future 
[17:01:06] <razzi>	 mforns: I second elukey +1'ing the descriptive commit message. IMO it's a good time to reexamine the use cases of this script and see if there's a way that it can do what it needs to while relying less on complex regex. As I understand, use cases that start with `--path-format='.+/ ...` will always pass the new `path_is_partial_match` check, so that is would only be a partial solution
[17:02:35] <mforns>	 razzi: that could be solved when adding new deletion jobs, do not use '.+/', instead use '[^/]+'
[17:03:18] <elukey>	 we could also explore a depth-first approach for dir traversal, and see if python+asyncio/gevent/etc.. could lead to acceptable perfs
[17:03:30] <elukey>	 (most of the time IIUC the script waits for I/O from hdfs or hive)
[17:06:33] <elukey>	 mforns: we can go ahead with this fix in my opinion, and maybe brainstorm about a long term solution? What do you think? 
[17:06:35] <mforns>	 elukey: you mean using depth-first instead of the "threshold"-pruning?
[17:07:19] <elukey>	 mforns: yep I mean adding a boundary on the maximum number of paths to ls() in one go
[17:07:25] <elukey>	 if even possible 
[17:07:58] <mforns>	 elukey: I don't see performance problems after this fix. It is true, as razzi says, that the regexp parameter is complicated, but it was like that since we created this script
[17:17:57] <wikibugs>	 10Analytics, 10Operations: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10herron) p:05Triage→03Medium
[17:35:36] <elukey>	   mforns yep yep I agree that the fix is good, what I was wondering is if in the future, maybe with more complex subtrees, we could end up in a situation in which even if with the regex the script hits the max arg list
[17:35:54] <elukey>	 but we can tackle the problem if it presents itself in the future
[17:36:19] <elukey>	 (I don't want to nitpick just reason out loud with ideas, don't feel blocked)
[17:36:48] <mforns>	 elukey: if we use regexes carefully, like the example razzi and I were discussing (use '[^/]+/' instead of '.+/'), I think this should not give more issues in the future
[17:38:00] <mforns>	 however, I agree with razzi, that the regexp argument is not ideal, and we can think of solving this in the future
[17:38:20] <elukey>	 yep you two can follow up on what's best :)
[17:39:49] <mforns>	 yea, sounds goo
[17:39:53] <mforns>	 *good :]
[17:41:03] <joal>	 dumb idea mforns, elukey, razzi - Would rewriting the algo in jvm world make our life easier through having better HDFS api?
[17:41:28] * joal run and hides into jvm world
[17:41:34] <mforns>	 heheheh
[17:43:30] <mforns>	 I think the problem that we have now is the difficult regular expression argument, which is probably unrelated to language. But yea, we can consider moving to Scala? not Java please!
[17:43:34] <mforns>	 :P
[17:44:03] <joal>	 You have my entire support for scala-not-java mforns :)
[17:47:34] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, 10Patch-For-Review: MEP Client MediaWiki PHP - https://phabricator.wikimedia.org/T253121 (10Mholloway) Per discussion in the team meeting earlier this week, this can live on the PID workboard only rather than also PI core...
[18:01:15] <nuria>	 mforns, joal : +1 to mforns (and razzi) that issue is what arguments  and the way the tree is traversed not language
[18:01:32] <joal>	 works for me :)
[18:06:39] * elukey afk!
[18:10:38] <wikibugs>	 (03CR) 10Razzi: Fix directory expansion bug in refinery-drop-older-than (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns)
[18:20:28] <wikibugs>	 (03PS2) 10Mforns: Fix directory expansion bug in refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495)
[18:20:46] <wikibugs>	 (03CR) 10Mforns: Fix directory expansion bug in refinery-drop-older-than (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns)
[18:23:58] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10MMiller_WMF) Oh okay, great!  @Miriam -- could you please check to see that the data you need is there?  Then we can resolve the task.
[21:21:36] <nuria>	 mforns: ok, back on this i understand the  issue with dropping script now
[21:43:44] <wikibugs>	 10Analytics-Clusters, 10Operations, 10decommission-hardware, 10ops-eqiad: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10Dzahn) The hosts here are showing up in a weird state. When running the DNS cookbook you get warnings that these hosts exist but are not "in devices...
[22:30:29] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Improve discovery of paths to delete in refinery-drop-older-than - https://phabricator.wikimedia.org/T263495 (10Nuria) FYI, that i reruned this timer as it is now deployed on an-launcher1002 (with only the 'order' fix but not the subsequent fix) and it wor...
[22:35:29] <wikibugs>	 (03CR) 10Nuria: Fix directory expansion bug in refinery-drop-older-than (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns)
[23:13:21] <wikibugs>	 10Analytics, 10Operations, 10Research-Backlog, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila)
[23:33:27] <wikibugs>	 10Analytics, 10Operations, 10Research-Backlog, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Nuria) My opinion on this request is that having non throughly supervised contributors accessing data introduces...