[08:11:17] o/ [08:11:28] bonjour! [08:11:34] how are you elukey ? [08:12:54] good :) [09:05:23] (03CR) 10Elukey: "Left a comment for utils.py, just a nit. The code in there looks good!" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553160 (https://phabricator.wikimedia.org/T239127) (owner: 10Joal) [09:06:02] really a nit Joseph, the code in utils.py looks good [09:06:56] ideally, now that we are on py3 only, we could start adding type annotations to the code to make it more clear [09:06:57] elukey: no big deal - I wondered about doig so [09:07:07] and refactor if necessary where too ambiguous [09:07:08] (using lists only) [09:07:19] I don't know about type annotations [09:07:20] I find it more clear but it is really an option [09:07:33] I'll make a list, simpler [09:07:51] ah you can add type indications to the signature of the function, that are not mandatory for the compiler/interpreter but very useful [09:08:29] and also https://docs.python.org/3/library/typing.html [09:09:19] makes sense elukey - types are useful IMO [09:11:20] (03PS14) 10Joal: Add new tables and features to sqoop script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553160 (https://phabricator.wikimedia.org/T239127) [09:11:36] (03CR) 10Joal: Add new tables and features to sqoop script (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553160 (https://phabricator.wikimedia.org/T239127) (owner: 10Joal) [09:12:09] Thanks for the review elukey [09:12:16] (03PS5) 10Joal: Add newly sqooped tables to hive and mw-load job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553405 (https://phabricator.wikimedia.org/T239127) [09:24:17] o/ hi joal! [09:24:32] addshore, hello! [09:24:32] how easy is it to do a 1 off scoop of mysql tables into hadoop to do some work / checking of things? [09:24:48] addshore: which table are you after? [09:25:00] wb_terms for wikidata, and then all tables prefixed with wbt_ [09:26:08] basically we need to do some comparison between them and make sure there are no holes in the new tables (as we thing there has been an issue during migration) [09:26:10] addshore: I'm not aware of use having them yet (even as not-prod dataset) [09:26:17] and doing that in SQL would take a long old while [09:26:31] addshore: sqoop is not difficult and would be useful :) [09:26:53] I forget how many rows wb_terms is, something billion, :P [09:27:01] nice [09:27:22] * addshore has no idea how to use sqoop [09:27:27] addshore: the list of tables and their schema will be needed [09:27:45] "and their schema", as in mysql schema? will that do? [09:27:49] addshore: I also suggest providing a CR for adding those tables in our sqoop [09:28:07] well, I dont think we want this to be a permenant thing, thats the thing [09:28:11] script - This would allow for easier sqooping next time (and even productionization eventually) [09:28:53] the wbt_ tables could be useful, actually (they are the secondary index of all terms, so labels description and aliases) [09:29:13] addshore: let's make it right from the beginning? [09:29:28] addshore: very easy to update the script (I'll explain) [09:30:24] addshore: from https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L215 and after you can see how we describe queries from tables [09:31:02] addshore: If you provide the main query for each table you need, I'll make the CR and integration with the script :) [09:31:15] addshore: is that reasonnable? [09:31:20] Okay, I can do that I think :) [09:39:12] joal: is split-by somehting to do with how it ends up being partitioned? [09:40:51] addshore: split-by is the key by which data-gathering is partitioned - I should be the key used in boundary-query, which gathers boundary values for that key [09:40:56] addshore: makes sense? [09:41:16] okay! yup [09:42:03] 10Analytics, 10Analytics-Kanban: Create kerberos principals for users - https://phabricator.wikimedia.org/T237605 (10Miriam) >>! In T237605#5700591, @elukey wrote: >>>! In T237605#5700493, @Miriam wrote: >> Hi! Can I have access too please? My username is mirrys > > ` > elukey@krb1001:~$ sudo manage_principal... [09:42:30] hmm, will doing joins on the scooped data end up working? [09:44:57] (03PS1) 10Addshore: sqoop, add wikidata terms related tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553698 [09:45:08] ^^ queries for all of the tbales [09:45:42] addshore: joins in spark work, yes :) [09:45:52] good good :D [09:46:56] addshore: do you have a minute to hangout discussing your patch (will be quicker) [09:47:13] on a train currently, so not right now, but maybe in an hour! [09:47:28] it can wait until then :) [09:47:29] addshore: will question in the chan then :) [09:47:37] okay! [09:47:51] addshore: are those tables all big ? [09:48:12] no, wb_terms is very big, the others are all quite a lot smaller, both narrower and shorter [09:48:59] wbt_item_terms is quite long / tall too actually [09:49:10] I can get some concrete ish numbers if that would help :) [09:49:21] nearly at my train stop, so might have to pause [09:49:43] np addshore, ping me when you are back online [10:12:47] so I am trying to figure out why git fat is being used for refinery where we explicitly say no [10:13:46] WAT? [10:14:10] well I have removed refinery from notebook1003 and re-deployed [10:14:11] elukey@notebook1003:/srv/deployment/analytics/refinery$ du -hs .git/fat/objects/ [10:14:14] 4.4G .git/fat/objects/ [10:14:51] elukey: wouldn't be that leftover of previous deploy? [10:15:15] joal: I removed everything before re-deploying from scratch [10:15:23] Ah! [10:15:28] ok - got snapped :) [10:16:16] np, trying to reason out loud [10:16:17] :) [10:16:32] it may happen that our current scap config is not correct [10:17:27] elukey: this seems a very plausible explanation [10:20:26] in theory it seems good, because we define a specific environment to use [10:36:10] joal: ah! I found something [10:36:31] I removed the git fat config from the main scap.cfg [10:36:46] and then deployed with [10:36:46] scap deploy --environment thin --limit notebook1003.eqiad.wmnet --no-log-message --verbose --force [10:36:52] no git fat binaries deployed [10:37:38] hm [10:37:52] I think I don't understand what the problem is :( [10:39:08] I do now, sending a code review [10:39:14] it will be clear [10:41:28] elukey: I also have found the problem of the GeoCode hive UDF [10:41:35] (03PS1) 10Elukey: Override scap config for git_binary_manager in thin env [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/553702 [10:41:58] joal: ah nice! [10:42:00] elukey: It is very understandable once found, but tricky to diagnose :) [10:42:00] what was it? [10:42:02] (03PS2) 10Elukey: Override scap config for git_binary_manager in thin env [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/553702 [10:42:09] this is the scap issue --^ [10:42:23] elukey: various things to notice - A check was done in the code usi [10:42:41] ng assert - And from what I read assertions are disabled by default (so no check) [10:43:23] then: we configure the database-reader based on parameters in the jobConfig (configure method) - But, for local small jobs, no jobConfig because no job!! [10:43:35] So no configure, therefore no init [10:44:36] very sneaky [10:44:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] Override scap config for git_binary_manager in thin env [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/553702 (owner: 10Elukey) [10:45:50] elukey: Ahhh - Incorrect configuration setting :) [10:48:40] joal: removed refinery from notebooks and labstores [10:48:43] then re-deployed thin [10:48:44] elukey@labstore1007:/srv/deployment/analytics/refinery$ du -hs [10:48:48] 38M . [10:48:50] \o/ [10:49:03] * joal bows to elukey perseverance [10:49:56] joal: same thing for you for the geoip stuff! [10:53:21] elukey: I'll be AFK for ~1h - see you then :) [10:59:42] o/ [11:03:20] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts - https://phabricator.wikimedia.org/T234229 (10elukey) I have fixed the refinery "thin" deploy configuration with https://gerrit.wiki... [11:33:19] * elukey lunch! [11:44:25] o/ joal, back to continue our earlier thingy when you are back [11:44:31] I also have a question about eventlogging :D [12:01:31] Hey addshore [12:01:45] I'm probably not the best one to answer EL question [12:02:44] hehe, I'll as it anyway just in case! [12:03:55] Does the /beacon URL that is hit by frontend code for eventlogging etc go anywhere? or is that just the method of getting the data into the webrequest logs? and then EL takes data from the logs? OR does the beacon actually process stuff and pass data to some event logging service / pipe? [12:04:21] as for the other stuff, shall we have a call? let me just grab a cup of tea! [12:07:07] * addshore is ready when you are [12:07:20] \o/ :) [12:07:39] addshore: https://meet.google.com/rxb-bjxn-nip [12:11:27] addshore: for reference - EL frontend sends data to kakfa (1 single topic), and then EL workers validate against schema and sends back to kafka (1 topic per schema) :) [12:23:44] (03PS2) 10Addshore: sqoop, add wikidata terms related tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553698 [12:25:07] (03PS3) 10Addshore: sqoop, add wikidata terms related tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553698 [12:27:42] (03PS4) 10Addshore: sqoop, add wikidata terms related tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553698 [12:38:49] (03PS5) 10Addshore: sqoop, add wikidata terms related tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553698 [12:47:49] 10Analytics, 10Wikidata: Sqoop wikidata terms tables into hadoop - https://phabricator.wikimedia.org/T239471 (10Addshore) [12:48:04] 10Analytics, 10Wikidata: Sqoop wikidata terms tables into hadoop - https://phabricator.wikimedia.org/T239471 (10Addshore) [12:48:17] (03PS6) 10Addshore: sqoop, add wikidata terms related tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553698 (https://phabricator.wikimedia.org/T239471) [12:48:29] 10Analytics, 10Wikidata, 10Patch-For-Review, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Sqoop wikidata terms tables into hadoop - https://phabricator.wikimedia.org/T239471 (10Addshore) [12:48:44] 10Analytics, 10Wikidata, 10Patch-For-Review, 10User-Addshore, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Sqoop wikidata terms tables into hadoop - https://phabricator.wikimedia.org/T239471 (10Addshore) a:03Addshore [12:51:22] 10Analytics, 10Wikidata, 10Patch-For-Review, 10User-Addshore, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Sqoop wikidata terms tables into hadoop - https://phabricator.wikimedia.org/T239471 (10Addshore) [12:52:12] I made some phab tickets to track this :) [12:52:45] cool :) [12:54:13] also it did the small 2 billion row tables, now for the big one [12:57:43] -- Creates table statement for raw mediawiki_wb_terms table. [12:57:44] OR [12:57:48] -- Creates table statement for raw wikibase_wb_terms table. [12:58:37] addshore: the name of the DB - I suggest going for wikibase_wb_terms - Very explicit difference from mediawiki_x [12:58:48] Thanks for suggesting! [12:58:54] will do! [13:14:28] (03PS1) 10Joal: Fix GetGeoDataUDF and underlying function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/553726 (https://phabricator.wikimedia.org/T238432) [13:17:07] (03PS7) 10Addshore: sqoop, add wikidata terms related tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553698 (https://phabricator.wikimedia.org/T239471) [13:17:14] (03PS1) 10Addshore: hive tables for wikibase term secondary storage [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553727 (https://phabricator.wikimedia.org/T239471) [13:17:28] joal: I noticed 1 more error in the sqoop thing in 1 table, fixed in PS7 [13:17:37] one field was missing for one table [13:17:51] okey :) [13:18:02] Will relaunch once the res is done [13:20:38] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Fix non MapReduce execution of GeoCode UDF - https://phabricator.wikimedia.org/T238432 (10JAllemandou) [13:21:04] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Fix non MapReduce execution of GeoCode UDF - https://phabricator.wikimedia.org/T238432 (10JAllemandou) [13:22:07] ack [13:22:39] joal: wb_terms (the big one) only took 27 mins :D [13:22:47] I just saw that :) [13:22:50] this is grat [13:22:54] +e [13:24:10] addshore: will relaunch wbt_text_in_lang [13:24:15] thanks! [13:26:00] Started [13:27:24] 10Analytics, 10Analytics-Kanban: Create kerberos principals for users - https://phabricator.wikimedia.org/T237605 (10tizianopiccardi) >>! In T237605#5700591, @elukey wrote: >>>! In T237605#5700493, @Miriam wrote: >> Hi! Can I have access too please? My username is mirrys > > ` > elukey@krb1001:~$ sudo manage_... [13:28:52] joal: all done! [13:28:57] quper quick [13:29:00] *super [13:37:04] 10Analytics, 10Analytics-Kanban: Create kerberos principals for users - https://phabricator.wikimedia.org/T237605 (10elukey) >>! In T237605#5701535, @tizianopiccardi wrote: > > Hi Luca, I need it too! Thank you > > Username: piccardi > Email: tiziano.piccardi@epfl.ch ` elukey@krb1001:~$ sudo manage_principa... [13:37:11] (03CR) 10Joal: hive tables for wikibase term secondary storage (0314 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553727 (https://phabricator.wikimedia.org/T239471) (owner: 10Addshore) [13:37:27] addshore: tables created, and comments for some fixes on table creation script :) [13:37:35] addshore: tables are created in joal db [13:38:18] * addshore looks [13:39:38] fixed! [13:39:39] (03PS2) 10Addshore: hive tables for wikibase term secondary storage [analytics/refinery] - 10https://gerrit.wikimedia.org/r/553727 (https://phabricator.wikimedia.org/T239471) [13:39:46] I meant to go back through and fix those commas.... [13:39:52] :) [13:40:00] is the data already in the tables too? [13:40:03] elukey: I need superpowa-man please :) [13:40:07] addshore: it should! [13:40:18] wowza, that was much faster than I was expecting [13:40:30] * addshore goes to notebook :D [13:40:34] :) [13:40:39] Have fun addshore :) [13:41:18] 10Analytics, 10Wikidata, 10Patch-For-Review, 10User-Addshore, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Sqoop wikidata terms tables into hadoop - https://phabricator.wikimedia.org/T239471 (10Addshore) Should be all done, and tables are created in joal db [13:46:50] joal: what will the snapshot be? [13:47:07] 2.469 billion rows in joal.wikibase_wb_terms - nice [13:47:17] addshore: '2019-10' [14:01:50] :D [14:02:11] Looks like we have an happy addshore on the chan :) [14:02:15] :DDDD [14:02:27] joal, so, I have 2 tables, each of which now just contain 1 col of bigints, and I essentially want to do a diff [14:02:32] got any spark magic for that? [14:02:50] addshore: left-outer join [14:02:55] and filter for nulls [14:04:05] nice --^ [14:04:33] aaah yes, I guess i just keep doing it in sql.... [14:06:35] and actually I said left, better to use full-outer, store in df, and then filter for nulls in either side [14:28:10] elukey: I'm getting something very weird :) [14:29:24] joal: where? :D [14:29:41] elukey: in hive, trying to fix the geocode issue [14:30:31] elukey: I'd like to get, on an-worker1078.eqiad.wmnet, the content of /var/lib/hadoop/data/h/yarn/local/usercache/joal/appcache/application_1573208467349_112031/container_e02_1573208467349_112031_01_000343/hs_err_pid37420.log [14:31:04] * addshore ran away and came back, got java.lang.OutOfMemoryError for the last thing I tried :D [14:31:20] * addshore will try this data frame thingy and full outer [14:31:35] elukey: this is what hadoop application logs tell me for my failing hive query: https://gist.github.com/jobar/313922ed53d310b5c5a34f095bce7757 [14:33:04] lovely [14:36:39] joal: so that is related to the new code? [14:36:46] yessir [14:36:58] elukey: I tried various things, and don't understandb [14:38:43] it seems that the jvm is freeing something wrong.. we could enable coredumps on all the workers, but the stacktrace would not probably give us insights about the java code that triggers it [14:38:47] sneaky [14:38:57] yeah, unhappy me :() [14:52:44] Gone for kids [14:54:17] o/ [14:55:06] addshore: just to be sure, do you guys have any automated job (cron/etc..) that pushes/reads from Hadoop ? [14:55:19] oooh, maybe [14:55:19] to check if it will break or not with kerberos [14:55:35] goran definitely will, i think, but he is aware kerberos is coming [14:55:40] let me quickly check our other scripts [14:56:05] I dont think we have other ones [14:56:49] super [14:56:55] yes yes I already had a chat with Goran :) [15:01:29] * addshore has a 2709497 row result from spark, that I need to export somehow and make public... How can I do that? :D [15:01:39] (one off thing) [15:02:00] 10Analytics, 10Patch-For-Review: Productionize navigation vectors - https://phabricator.wikimedia.org/T174796 (10Aklapper) @Shilad: Hi! Is this task still valid and should still be open (and its patches in Gerrit)? If yes, are you still working (or still plan to work) on this task? If you do not plan to work o... [15:25:50] ooooh, think I'm getting there [15:39:11] 10Analytics, 10Patch-For-Review: Productionize navigation vectors - https://phabricator.wikimedia.org/T174796 (10Shilad) Hi @Aklapper, Thanks for asking! I think this got stuck in code review. I'm happy to step in and move it forward once folks have time to code review it. [15:39:34] 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, 10Operations, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [15:49:45] 10Analytics, 10Wikidata, 10Patch-For-Review, 10User-Addshore, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Sqoop wikidata terms tables into hadoop - https://phabricator.wikimedia.org/T239471 (10Addshore) @JAllemandou I'll move this to waiting on our board for now. I guess we should probably merge... [15:50:00] :D woooo [16:51:41] I guess it has worked addshore :) [16:53:38] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/552510 (https://phabricator.wikimedia.org/T238855) (owner: 10Joal) [16:53:59] yup [16:54:13] Great :) [16:57:14] 10Analytics, 10Analytics-Kanban, 10Wikidata, 10Patch-For-Review, and 2 others: Sqoop wikidata terms tables into hadoop - https://phabricator.wikimedia.org/T239471 (10JAllemandou) [17:39:46] 10Analytics, 10User-Elukey: Redesign architecture of irc-recentchanges on top of Kafka - https://phabricator.wikimedia.org/T234234 (10elukey) [17:42:12] 10Analytics, 10User-Elukey: Redesign architecture of irc-recentchanges on top of Kafka - https://phabricator.wikimedia.org/T234234 (10elukey) Today I digged a bit more in the current architecture of irc.wikimedia.org, and I have added all the info to the description of the task. It seems a bit crazy to me that... [17:47:01] * elukey off! [19:04:32] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Growth: implement wider data purge window - https://phabricator.wikimedia.org/T237124 (10Nuria) ping @nettrom_WMF again, let us know if @mforns propsed plan works cc @kzimmerman [21:42:46] 10Analytics, 10Multimedia, 10Tool-Pageviews: Statistics for views of individual Wikimedia images - https://phabricator.wikimedia.org/T210313 (10Nuria) @Tgr so as of today the mediaviewer sends pings to media/beacon and not sure what happens from there. Now, there are no special headers for those requests to... [21:50:15] 10Analytics, 10Multimedia, 10Tool-Pageviews: Statistics for views of individual Wikimedia images - https://phabricator.wikimedia.org/T210313 (10Nuria) Ping to @fdans as as part of teh media requests api we need to provide a measure how many of those requests might be preloads for mediaviewer. [21:53:11] 10Analytics, 10Multimedia, 10Tool-Pageviews: Statistics for views of individual Wikimedia images - https://phabricator.wikimedia.org/T210313 (10Nuria) The way things are instrumented right now I do not think there are any headers on mediaviewer preload requests, as such those are indistinguishable from regul... [23:22:33] (03CR) 10Nuria: [C: 03+1] Override scap config for git_binary_manager in thin env [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/553702 (owner: 10Elukey)