[04:29:12] (03CR) 10Krinkle: "Thanks!" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/409714 (https://phabricator.wikimedia.org/T187010) (owner: 10Nuria) [08:48:48] hello people [08:49:08] just checked eventlogging and all the processors stopped dying after the deployment \o/ [10:26:39] * elukey coffee! brb [10:54:08] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Verify duplicate entry warnings logged by the m4 mysql consumer - https://phabricator.wikimedia.org/T185291#3985053 (10elukey) [10:55:04] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Verify duplicate entry warnings logged by the m4 mysql consumer - https://phabricator.wikimedia.org/T185291#3912027 (10elukey) After the deployment I can't see any more duplicate errors in the m4-consumer log and no trace of p... [12:00:01] * elukey lunch! [12:10:17] (03PS10) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.1.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 [15:03:02] first version of https://grafana-admin.wikimedia.org/dashboard/db/kafka-consumer-lag ready! [15:03:16] (without the -admin) [15:04:11] ours would be https://grafana-admin.wikimedia.org/dashboard/db/kafka-consumer-lag?orgId=1&from=now-6h&to=now&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=eqiad [15:44:24] 10Analytics, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10WMDE-Analytics-Engineering, and 5 others: How often are new editors involved in edit conflicts - https://phabricator.wikimedia.org/T182008#3986038 (10GoranSMilovanovic) - This will be solved from the [[ https://meta.wikimedia.org/wiki/Schema:Edit... [15:52:08] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Wikimedia-Stream: Increase kafka event retention to 14 or 21 days - https://phabricator.wikimedia.org/T187296#3986057 (10Ottomata) Is there a reason we want to do this on main instead of jumbo? Stas will be consuming from jumbo, since it... [16:01:14] fdans: GET IN THE STANDUP YOU BETTER WATCH OUT I'M IN CHARGE NOW [16:01:28] oh no [16:01:56] (03PS11) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.1.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 [16:02:30] ottomata: you are frozen [16:04:59] (03CR) 10jerkins-bot: [V: 04-1] Upgrade scala to 2.11.7 and Spark to 2.1.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 (owner: 10Joal) [16:12:34] ottomata: Mwarf :( Any idea on how to make jenkins give more memory to that docker? https://integration.wikimedia.org/ci/job/analytics-refinery-maven-java8-docker/5/console [16:14:48] (03PS1) 10Mforns: [WIP] Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [16:32:25] joal: no, ask hashar? [16:49:22] 10Analytics-Kanban, 10ChangeProp, 10EventBus, 10Patch-For-Review, and 2 others: Export burrow metrics to prometheus - https://phabricator.wikimedia.org/T180442#3758088 (10elukey) So summary: - we have https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag up and running, but main-codfw is not showi... [16:54:13] ottomata: here is the code to make presto work in slider: [16:54:20] ottomata: https://github.com/prestodb/presto-yarn [16:56:39] 10Analytics, 10Analytics-Wikistats: data missmatch for number of editors - https://phabricator.wikimedia.org/T187806#3986460 (10Lydia_Pintscher) [17:01:09] 10Analytics, 10Analytics-Wikistats: data missmatch for number of editors - https://phabricator.wikimedia.org/T187806#3986460 (10JAllemandou) @Lydia: You should split by editor type. The editors you are talking about are, I think what we call in Wikistats 2 `registered-users editors`. Please let us if I'm wrong! [17:29:18] 10Analytics, 10Analytics-Wikistats: Data mismatch for number of Wikidata editors - https://phabricator.wikimedia.org/T187806#3986662 (10Nemo_bis) [17:38:09] 10Analytics, 10Analytics-Wikistats: Data mismatch for number of Wikidata editors - https://phabricator.wikimedia.org/T187806#3986460 (10Nemo_bis) Note that the only official number for "active editors" is the 5+ edits/month editors, which is not (easily) provided by WikiStats v2. To get a (theoretically) comp... [17:40:27] (03PS12) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.1.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 [17:45:20] 10Analytics-Kanban, 10User-Elukey: Reduction of stat1005's disk space usage - https://phabricator.wikimedia.org/T186776#3986748 (10elukey) [17:47:09] ottomata: thoughts about where/how to put the stat1002-a dir in hdfs? [17:47:20] (from stat1005) [17:48:09] /wmf/data/archive/backup ? [17:49:48] could be it yes [17:50:53] ahh and using something like copyFromLocal I can avoid to compress the dir right? [17:51:02] (we have only 300GB left) [17:56:58] 10Analytics, 10InteractionTimeline, 10Anti-Harassment (AHT Sprint 15): Measure how many unique people visit the Timeline - https://phabricator.wikimedia.org/T187374#3986831 (10dbarratt) @Nuria Could you create a new project/account for the #interactiontimeline ? If you send me the JS code I should be able to... [18:15:44] ottomata: something like sudo -u hdfs -copyFromLocal /srv/stat1002-a /wmf/data/archive/backup/stat_hosts ? [18:16:21] hdfs might not be able to read stat1002-a's files though [18:21:07] ? [18:21:30] elukey maybe /wmf/data/archive/backup/misc/stat1002-a (and then a README in stat1002-a) [18:21:33] ? [18:23:30] ok I am not going to argue names, but is the command correct? Never done it [18:25:22] looks ok i think! [18:25:24] can't hurt to try [18:26:12] sure [18:27:48] started, it might take a whie [18:27:49] *while [18:31:23] got some perms denided [18:31:49] ohh interesting, right, probably because hdfs user can't read everyting in /srv [18:31:51] ohHH that's what you mean [18:31:53] sorry didnt' understand [18:31:54] yeah [18:32:10] well nothing super big, I'll amend whatever will get perms denied [18:32:11] yeah, elukey you could put in /user/root/ in hdfs first [18:32:12] with sudo [18:32:20] and then hdfs dfs -mv into backup/ after [18:32:47] /wmf/data/archive/backup/misc/stat1002-a/discovery/golden-dev/test_2017-03-24_17:39:10.log.md is not a valid DFS filename [18:32:50] ah lol [18:33:55] all right, will check tomorrow morning.. ottomata if you have time would you mind to take a look? So we'll remove /srv/stat1002-a asap [18:34:13] task is https://phabricator.wikimedia.org/T186776 [18:34:20] otherwise I'll try to work on it tomorrow [18:35:01] ok [18:35:08] take a look to see if it is done? [18:35:10] you mean elukey? [18:35:21] nuria_ joal ottomata: hopefully I didn't cause any problems with https://phabricator.wikimedia.org/T184095#3986912 if I did I am so sorry [18:36:56] ottomata: nono my copyFromLocal failed due to the weird name, so it should be restarted [18:37:15] maybe excluding/renaming those weird filenames [18:37:57] anyhow, going afk :) [18:38:01] * elukey off! [19:00:12] yo joal: a 3rd person asked me about clickstream for Wikidata, I am not sure how easy it to generate something similar to the WP datasets or if you guys have thought of this [19:08:21] DarTar: I have ideas around what this means, but this is very much not "clickstream for wikidata" I have in mind - I'm thinking more of: clickstream cross-language (tagged by language) between wikidata entitiesb [19:08:32] bearloga: I'll have a look at your ticket [19:08:42] bearloga: I'm pretty sure you didn't break anything [19:08:48] bearloga: When creatin [19:08:51] again sorry [19:09:30] bearloga: When creating a table in hive, file format is an important improvement- I'm assuming you'll be interested in analytics-oriented queries over your data, therefore you should use parquet [19:09:57] bearloga: We have a ticket to make parquet the default file format for hive table creation, but our hive version doesn't allow for it as of now [19:13:45] bearloga: something else I'm tinking of: you extract distinct - wouldn't it be interesting to keep the number of actions? [19:16:27] joal: thanks! I should modify the query and rerun. although I'm having a hard time tracking down the correct way to specify the parquet file format (in the CREATE TABLE statement vs SET???); and yeah! I should probably include a count of requests per hour [19:16:46] bearloga: commenting on the ticket, give me aminue :) [19:21:14] joal: thank you very much!!! [19:21:53] bearloga: I'll add other comments in the ticket as well for the archives :) But prefer to write them to you in here first :) [19:22:35] bearloga: I'm looking at the data in the table, and 1h of data is ~10Mb [19:22:50] bearloga: There is no point to store that by the hour [19:23:02] bearloga: Let's partition by day :) [19:27:32] joal: weird question and I suspect the answer is no but do you think it's possible to…dynamically?…partition the data? like, the query extracts a lyear/lmonth/lday which are localized versions via timezone. could we store data into partitions based on those? [19:27:45] bearloga: feasible [19:28:18] bearloga: but not super easy, because currently-existing partitions overlap newly created ones [19:28:35] bearloga: Doing so would mean extracting most of the data ot once [19:28:40] bearloga: Feasible though [19:29:53] bearloga: I'm currently waiting for my test of the modified code to finish, and will send the comment [19:31:46] joal: I see. if I'm thinking about it correctly, each query run (if operating on an hour of data) would be sending data to & resizing #{timezones}+1 partitions (the +1 is for "Unknown") which would be hard to deal with [19:32:21] bearloga: That's the idea [19:33:06] bearloga: a small amount of data would be sent to TS+X / TS-X, making data not so efficient to wirk with [19:34:35] (03PS6) 10Ottomata: Add dataframe conversion to new schema function [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410241 [19:34:44] (03CR) 10Ottomata: Add dataframe conversion to new schema function (031 comment) [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410241 (owner: 10Ottomata) [19:34:56] bearloga: However if we work a full month (or 2) at once, we get correct partitions [19:36:25] joal: month +/- 12 hours on each end :P [19:36:38] joal re https://gerrit.wikimedia.org/r/#/c/411090/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/jsonrefine/SparkSQLHiveExtensions.scala [19:36:43] i tried to reuse convert function [19:36:44] bearloga: indeed :) [19:36:54] couldn't, because of the custom logic needed in the recursino [19:37:46] it does seem like it should be possibel though... [19:38:37] (03PS4) 10Ottomata: Clean up the normalize function and add a new makeNullable function [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/411090 [19:44:50] ottomata: https://gist.github.com/jobar/b121b0b0f5aa9fd3a7299c38b676f405 [19:45:56] wow that is simpler than i thought it would be [19:45:58] magic man [19:46:02] :D [19:46:20] ottomata: Adding depth ;) [19:46:43] into it [19:47:15] (03CR) 10jerkins-bot: [V: 04-1] Add dataframe conversion to new schema function [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410241 (owner: 10Ottomata) [19:47:50] bearloga: Comment sent ! [19:48:04] bearloga: please feel free to let me know this stuff is not as expected ;) [19:48:19] (03CR) 10Ottomata: Refactor JsonRefine to use DataFrame converter (032 comments) [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410942 (owner: 10Ottomata) [19:49:21] joal: this is awesome!! thank you very much! [19:49:30] bearloga: you're very welcome :) [19:55:11] joal, trying to figure it out, but struct.convert((field, _) => field.makeNullable(nullable)) isn't quite working [19:55:28] ottomata: heh [19:55:41] type mismatch [19:56:02] OH [19:56:03] i know [19:56:03] sorry [19:56:14] ? [19:56:28] paste error, didn't change fn type in convert arg [19:56:34] Ahhh [19:56:36] ok :) [19:57:40] i thought it was scala weirdness with multi params or something, never can figure that syntax out, so assumed it was wrong [19:58:43] (03PS4) 10Ottomata: Refactor JsonRefine to use DataFrame converter [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410942 [20:00:37] wow too many patches joal, i added that on the latest one [20:00:39] oh well [20:01:25] joal, if you have a sec to look over real quick, maybe we can merge into this branch, and then I can work from there on spark 2-ifying [20:03:32] ottomata: The one above looks good at first sight [20:04:06] well, the 3 i mean [20:04:18] let's merge all three of these patches, its getting unwieldy reviewing 3 at once [20:04:26] ottomata: agreed [20:04:42] ottomata: I mean, there's been enough work put into that for it to be good enough I think :) [20:05:08] There is one last nit I want to check after your merge, and it'll be as good as I can think of :) [20:05:16] aye, especially for just remote branch merge [20:05:19] ok [20:15:48] (03CR) 10Ottomata: "Whatever is causing jenkins to barf is fixed in the next patch." [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410241 (owner: 10Ottomata) [20:21:32] joal: one last nit? :) [20:22:26] ottomata: Using an accumulator instead of count to return your number- Let's wait for spark2 and the new API [20:23:11] ok [20:23:18] so gimme some sweet +1s :) [20:24:04] (03CR) 10Joal: [C: 031] "Let's finish this thing :)" [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410241 (owner: 10Ottomata) [20:25:18] (03CR) 10Joal: [C: 031] "Finishing - Step 2" [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410942 (owner: 10Ottomata) [20:25:59] (03CR) 10Joal: [C: 031] "Last one I think" [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/411090 (owner: 10Ottomata) [20:26:21] ottomata: I think I have not seen the last thing with the depth [20:33:55] ? [20:34:04] joal: last thing with depth? [20:34:07] oh [20:34:38] ... [20:34:55] where did it go... [20:35:49] err i don't know [20:35:51] ok... [20:35:58] ottomata: sorry :( [20:36:19] will repatch [20:36:26] dunno what happened [20:38:31] (03PS5) 10Ottomata: Clean up the normalize function and add a new makeNullable function [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/411090 [20:38:34] ok added it joal ^ [20:39:01] (03CR) 10Ottomata: [V: 032 C: 032] Add dataframe conversion to new schema function [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410241 (owner: 10Ottomata) [20:39:09] Thanks ottomata :) [20:39:50] (03CR) 10Ottomata: [V: 032 C: 032] Clean up the normalize function and add a new makeNullable function (031 comment) [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/411090 (owner: 10Ottomata) [20:39:54] (03PS6) 10Ottomata: Clean up the normalize function and add a new makeNullable function [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/411090 [20:39:56] (03CR) 10Ottomata: [V: 032 C: 032] Clean up the normalize function and add a new makeNullable function [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/411090 (owner: 10Ottomata) [20:40:08] ottomata: I'll care getting that into my spark-2 patch and test [20:40:12] tomorrow [20:40:46] oh! [20:40:47] ok [20:44:03] ottomata: still one patch to merge, right>? [20:44:17] yeah, conflicts... [20:44:20] Mwarf [20:44:30] Ok - gone for tonight - Will test that tomorrow [20:45:14] byeeee [20:48:38] laters! [20:53:28] (03PS5) 10Ottomata: Refactor JsonRefine to use DataFrame converter [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410942 [20:54:16] (03CR) 10Ottomata: [V: 032 C: 032] Refactor JsonRefine to use DataFrame converter [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410942 (owner: 10Ottomata) [20:59:27] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Wikimedia-Stream: Increase kafka event retention to 14 or 21 days - https://phabricator.wikimedia.org/T187296#3987360 (10Smalyshev) I need it only on jumbo I think, that's where I'll be connecting. [22:34:30] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3987592 (10Ottomata) Yargh, @elukey. kafkatee. Gonna be weird on oxygen, since we can't make kafkatee consume from multiple kafka clus... [23:02:07] 10Analytics, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10WMDE-Analytics-Engineering, and 5 others: How often are new editors involved in edit conflicts - https://phabricator.wikimedia.org/T182008#3987638 (10GoranSMilovanovic) - @addshore @Lea_WMDE @Tobi_WMDE_SW This will take some time. - **Why this... [23:18:26] ottomata[m]: I'm cleaning up my home dir on stat1005 and for some reason I don't have permission to do anything with /home/bearloga/anaconda3, would you be able to delete that for me? please and thank you!