[07:10:04] hello team :] [07:14:13] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) Summary of the current state and results achieved: * We added... [07:16:31] mforns: o/ [07:17:19] RECOVERY - Check the last execution of check_webrequest_partitions on an-coord1001 is OK: OK: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:18:43] ran it manually on an-coord --^ [07:30:17] joal, yt? [07:34:37] joal, I've prepared the run of mediawiki history dumps, but don't want to run it without you here. [07:34:37] The code now is pointing to the already re-partitioned data from last run in /tmp [07:34:37] so it won't execute the spark job again, just will go over the oozie loop [07:34:37] this time though, the loop is sequential (parallel=false) [07:34:37] also, alarms are muted [07:34:58] I nead to leave now though... [07:35:21] If you want to execute the thing, here's the code to start the coordinator: https://pastebin.com/SD8LB71r [07:35:34] But I can do that when I'm back later, no problemo! [07:36:59] oh, btw, it needs to be executed from stat1007 [07:37:30] anyway, will see you guys before standup! byeeee [08:04:44] Hi mforns - I'm sorry I started late forgetting we had this appointment :( Please excuse me :( [08:05:24] Good morning team [08:06:36] !log Launch mediawiki-history-dump test from marcel forlder [08:06:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:08:15] elukey: o/ - Do we make happen the hive-site.xml patcdh? [08:11:28] joal: o/ have you already reviewed the patch? [08:12:08] elukey: I had not noticed it - reading [08:17:04] elukey: code looks great - maybe commit message a bit more precise? [08:17:13] elukey: if not necessary, all good :) [08:17:30] sure [08:20:19] joal: done, hope it is better [08:20:55] indeed ! [08:20:58] Thanks for that :) [08:22:25] ok - Marcel's job works a lot better with sequential actions- It would be interesting to test with parallel at year level, but this is premature optim [08:22:36] joal: do we want to deploy it today? [08:23:17] Something else I learn from monitoring that job: loops in oozie are done in a way that make previous steps in the loop leaving workflows while the loop is not finished [08:23:22] elukey: I dont think so [08:23:31] joal: also, something that I learnt today - Buster offers only java 11 :D [08:23:38] elukey: :( [08:23:41] pffff [08:24:08] I thought java8 was an option (from what I recall having read in a task from you or andrew) [08:24:31] elukey: about deploy - you were talking about hive right? [08:24:41] yeah I thought too, but Moritz told me that it was a pre-release version of buster that included java8 - now it is gone [08:24:47] joal: yeah hive [08:24:47] elukey: I was thinking at marcel's patch and did not switch context :) [08:25:04] As you want - We can either deploy it today or wait [08:25:26] it will require a restart of the hive server, but it is doable [08:25:29] elukey: I should be able to devise a test query [08:26:21] elukey: If ok for you, let me devbise that test, make sure the system behaves as expected (expected success and failures depending on conf) and then we can deploy? [08:27:01] +1 [08:53:12] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Hive query fails with local join - https://phabricator.wikimedia.org/T209536 (10JAllemandou) After @elukey found the reason for which local-tasks were failing ( thanks for the awesome troubleshoutting!), I devised this to be able to... [08:53:17] elukey: --^ last comment [08:57:56] looks good to me! [08:58:04] joal: ok to merge? [08:58:13] Let's go~ [09:03:37] mediawiki-history-dumps finished successfully without any trouble - Good :) [09:03:45] mforns: --^ for when you're back :) [09:08:12] joal: done, we need to restart the hive server when possible :) [09:09:02] elukey: monitoring the tasks - will ping when I think we're good [09:15:35] 10Analytics, 10ChangeProp, 10Discovery-Search, 10EventBus, and 3 others: Better way to pause writes on elasticsearch - https://phabricator.wikimedia.org/T230730 (10mobrovac) >>! In T230730#5432125, @Gehel wrote: >>>! In T230730#5422294, @mobrovac wrote: >> There already is a mechanism in change propagation... [09:31:25] 10Analytics: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) p:05Triage→03Normal [09:31:42] joal: --^ [09:33:51] elukey: the way I understand the last lines is that we'd rather go with java11 - correct? [09:35:27] in theory it would be preferrable, in practice it is a big mess in my opinion [09:36:25] elukey: I have an idea: Do all at once (java11, cdh 6.3, spark 2.4) - Upgrade, and then go in holiday :) [09:36:51] unrelated elukey - Good to go with hiveserver restart IMO [09:37:06] elukey: no hive job currently [09:38:20] !log restart hive-server2 on an-coord1001 to pick up new settings - T209536 [09:38:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:38:23] T209536: Hive query fails with local join - https://phabricator.wikimedia.org/T209536 [09:38:38] joal: done :) [09:38:43] elukey: testing ! [09:45:01] elukey: confirmed working :) [09:45:07] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Hive query fails with local join - https://phabricator.wikimedia.org/T209536 (10JAllemandou) Test after update using > ` > SELECT h.hostname, u.nb_users > FROM ( > SELECT wiki_db, COUNT(distinct user_id) as nb_users > FROM wmf.mediawiki_user_history... [09:45:11] super [09:48:09] going afk for a bit! [10:09:07] 10Analytics, 10Product-Analytics, 10Reading Depth, 10Readers-Web-Backlog (Readers-Web-Kanbanana-2019-20-Q1): Reading_depth: remove eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10phuedx) Per the above, [[ https://meta.wikimedia.org/w/index.php?title=Schema_talk%3AReadingDepth&t... [10:43:14] 10Analytics: Geoeditors_private deletion scripts scheduled day conflicts with retention period - https://phabricator.wikimedia.org/T231017 (10JAllemandou) Some thoughts before taking actions: - Datasets are monthly, so when we delete a full month there is at most 31 days discrepancy between the first and last d... [10:43:39] 10Analytics: Geoeditors_private deletion scripts scheduled day conflicts with retention period - https://phabricator.wikimedia.org/T231017 (10JAllemandou) @mforns, @Nuria - Comments ? --^ [10:48:10] * mforns back [10:48:45] Hi mforns! AS I mentionned earlier (you were already gone), please excuse me for having started late - I forgot about our appointment :( [10:48:50] hey joal, thanks for executing the job and following up, cool! It vvvorks. All right, then I'll leave it as is until we know more. [10:49:06] * joal is feeling a bit ashamed [10:49:12] joal, no problem at all! why? [10:49:24] I hate to give appointment and being the missing one [10:49:40] mforns: good thing is indeed, it works [10:49:43] we worked together anyway even if async [10:49:54] great [10:50:16] mforns: other interesting thing to notice - Every step of the loop stays live while the loop is not finished (live = has a started workflow step) [10:51:03] So at the end of the years loop, there are 30 workflows done 80% to be unpiled once the last is done [10:51:42] Other thought: we might want to paraellelize the years and keep the wikis sequential [10:51:52] mforns: --^ [10:51:55] joal, yes, 30 is the number I calculated [10:52:14] depth of 30 seems totally fine [10:52:19] mforns: this is when we go unparallel - when going parallel can be a lot more [10:52:34] but tree size has probably diminished significantly [10:53:01] It's not only about depth, it's about started-workflows in oozie :) [10:53:04] joal, I think the parallel version is also depth=30 [10:53:10] exactly [10:53:16] the problem was never the depth [10:53:21] rather the size of the tree [10:53:32] mforns: was is named depth is actually tree size [10:53:37] you're right [10:53:39] with 19 years and 5 wiki_groups, the size of the double loop was 301 workflows [10:53:47] depth is the number of co-running workflows [10:53:59] too many :) [10:54:36] mmmm, in the oozie documentation, they say... [10:54:39] also mforns, every loop step has multiple related workflows, so going parallel might actually mean more than 301 :) [10:54:57] no no, I calculated the related workflows as well [10:55:03] I have a formula xD [10:55:08] huhu :) [10:55:10] ok ! [10:55:19] just for this particular job, of course, not generic [10:55:24] it is: [10:55:50] tree depth = cardinality(wiki_groups) + cardinality(years) + 6 [10:56:32] tree size = (3 * cardinality(wiki_groups) + 3) * cardinality(years) + 1 [10:57:31] * elukey lunch! [10:57:52] my understanding is that when executed in sequence, the max tree depth is the same [10:58:04] but the tree size should be way smaller at all times [10:58:20] mforns: mforns maybe not at all times [10:58:47] mforns: Oh no you'e right [10:59:13] mforns: even with parallel years, there still will be at least nb-years workflowsb [10:59:17] Ok - great :) [10:59:22] Let's keep it this way [10:59:23] :) [10:59:34] need to step afk for a few minutes [10:59:37] heh, I haven't checked, but that was my hypothesis [10:59:38] bbiab [10:59:40] byee [10:59:59] ok [11:08:14] back :) [11:52:21] 10Analytics, 10Product-Analytics: Ensure Wikitech page about custom jupyter notebooks exists and is up to date - https://phabricator.wikimedia.org/T230742 (10JAllemandou) Done here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Custom_virtual_environment Ping @Neil_P._Quinn_WMF - Hopefu... [11:52:26] 10Analytics: Geoeditors_private deletion scripts scheduled day conflicts with retention period - https://phabricator.wikimedia.org/T231017 (10mforns) @JAllemandou > - IIRC the retention policy is about keeping AT MOST 90 days, so I'd rather keep 65, making sure we always have 2 months of data when the geoedito... [11:52:34] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ensure Wikitech page about custom jupyter notebooks exists and is up to date - https://phabricator.wikimedia.org/T230742 (10JAllemandou) [11:58:11] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Apply hive2-server fix to command line - https://phabricator.wikimedia.org/T230741 (10JAllemandou) a:03elukey [11:58:53] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ensure Wikitech page about custom jupyter notebooks exists and is up to date - https://phabricator.wikimedia.org/T230742 (10JAllemandou) p:05Normal→03High [12:00:30] elukey: I took the freedom to assign/move the task about hive-server [12:00:36] just letting you know [12:04:53] sure :) [12:14:59] elukey: if you don't mind proofreading https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie/Administration#Hard_killing_a_workflow [12:16:55] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Apply hive2-server fix to command line - https://phabricator.wikimedia.org/T230741 (10elukey) This happened with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/531866/ [12:19:43] joal: corrected a few typos, all good for me. QQ - why the ids should be less than 10? [12:19:48] (curiosity) [12:20:10] elukey: number of actions managed by a workflow id [12:20:21] elukey: could theoretically be more, but never seen [12:20:53] joal: ah yes maybe add a comment to that, because one could wonder why that number etc.. [12:20:57] but the rest look good! [12:21:07] ack! [12:21:09] (say for some reason it is 12, etc..) [12:21:10] thanks :) [12:21:49] joal: another thing - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Custom_virtual_environment [12:21:55] yes [12:21:55] we can use python3.7 in theory [12:22:04] elukey: not yet :) [12:22:05] it is something that Andrew was working on recently [12:22:16] IIRC the package has not been deployed [12:22:30] ah ok yes and then we'll update the page after it, okok [12:22:31] good [12:22:55] today I found something super interesting https://meta.wikimedia.org/wiki/Special:UrlShortener [12:24:14] !log Rerunning refine for eventlogging-analytics for 2019-08-23T03:00 [12:24:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:27:09] taking a break team - see you in a bit [12:50:49] 10Analytics, 10Product-Analytics, 10Reading Depth, 10Readers-Web-Backlog (Readers-Web-Kanbanana-2019-20-Q1): Reading_depth: remove eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10Nuria) @Groceryheist with our very limited resources (more so this year than in years past) we real... [13:24:37] 10Analytics: Geoeditors_private deletion scripts scheduled day conflicts with retention period - https://phabricator.wikimedia.org/T231017 (10Nuria) >Shouldn't we execute the script every day, It'd be no-op on most days, and delete data when we go over the stated period. +1 seems a lot less error prone [14:19:20] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Allow all Analytics tools to work with Kerberos auth - https://phabricator.wikimedia.org/T226698 (10elukey) For reference, awesome testing session done by Joseph: https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop_testing_clu... [14:36:18] (03PS1) 10Gilles: Retain PaintTiming + remove defunct NavigationTiming fields [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531935 (https://phabricator.wikimedia.org/T231087) [14:38:59] (03CR) 10Nuria: [C: 03+1] Retain PaintTiming + remove defunct NavigationTiming fields [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531935 (https://phabricator.wikimedia.org/T231087) (owner: 10Gilles) [14:39:12] (03CR) 10Nuria: [C: 03+2] Retain PaintTiming + remove defunct NavigationTiming fields [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531935 (https://phabricator.wikimedia.org/T231087) (owner: 10Gilles) [15:06:40] (03CR) 10Gilles: [V: 03+2] Retain PaintTiming + remove defunct NavigationTiming fields [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531935 (https://phabricator.wikimedia.org/T231087) (owner: 10Gilles) [15:11:31] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Allow all Analytics tools to work with Kerberos auth - https://phabricator.wikimedia.org/T226698 (10ArielGlenn) @elukey On our previous server we let people pull from us and it was very difficult to manage upgrades or any sort of maintenan... [15:14:29] 10Analytics: Add json linting test for schemas in mediawiki/event-schemas - https://phabricator.wikimedia.org/T124319 (10hashar) [15:51:50] 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10mforns) a:05Milimetric→03mforns [16:27:38] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Superset Updates - https://phabricator.wikimedia.org/T211706 (10Nuria) [16:27:40] 10Analytics-Kanban, 10Patch-For-Review: Upgrade superset to 0.34 - https://phabricator.wikimedia.org/T230416 (10Nuria) 05Open→03Resolved [16:27:53] 10Analytics, 10Analytics-Kanban: Upgrade Turnilo to its latest upstream - https://phabricator.wikimedia.org/T230709 (10Nuria) 05Open→03Resolved [17:01:14] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Portals: Review all the oozie coordinators/bundles in Refinery to add alerting when missing - https://phabricator.wikimedia.org/T228747 (10Nuria) This is a bit of an opsy task as well but I think is something that @JAllemandou can help with [17:01:30] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Portals: Review all the oozie coordinators/bundles in Refinery to add alerting when missing - https://phabricator.wikimedia.org/T228747 (10Nuria) a:03JAllemandou [19:05:07] 10Analytics: Access to HUE for cchen - https://phabricator.wikimedia.org/T231111 (10cchen)