[00:20:42] 10Analytics, 10Analytics-Wikistats: Change '--' to something more helpful in Wikistats page views by coutry table view - https://phabricator.wikimedia.org/T187427#4048471 (10sahil505) @mforns do you currently have something on mind instead of "--" ? [00:41:35] 10Analytics, 10Analytics-Wikistats, 10Accessibility, 10Easy: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#4048483 (10sahil505) @mforns which search placeholder and input are mentioned here, the one in TopicExplorer or the one in WikiSelector ? [06:23:29] madhuvishy: o/ [06:23:39] elukey: \o [06:24:03] how are youuuu [06:24:05] :) [06:24:26] if you have time we can chat about those ports, I'd need to ask you some info since I am super ignorant about nfs [06:24:48] I'm good! how are you :) [06:24:49] sure [06:24:56] gooood :) [06:25:12] so the docs that we have are in https://wikitech.wikimedia.org/wiki/Network_cheat_sheet#Edit_ACLs_for_Network_ports [06:26:26] cool that's easy to parse! ;) [06:26:58] elukey: so if you have access to the ACLs I'd pull whatever is set up for dataset1001 now [06:27:14] we can then look it over and make sure it makes sense to apply for the labstores too [06:27:54] madhuvishy: so that rules are basically filtering traffic from every host in the analytics vlan to our routers [06:28:08] right [06:28:16] what port do you need to be open? [06:28:23] so I can check what's there now [06:30:10] I think 2049 tcp, and there are probably a few more [06:30:13] I can pull up [06:30:17] that would be great [06:32:51] elukey: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/wmcs/nfs/ferm.pp [06:34:03] these are the ports in labstore1006&7 that stat hosts should be able to talk to [06:34:25] and now the nfs server is on dataset? [06:34:37] (I am trying to check if we have something in place now) [06:34:42] yeah - 208.80.154.11 [06:34:57] (that's dataset1001) [06:37:49] ah I found a rule that I wasn't aware of :D [06:38:25] madhuvishy: all right I think I have all that I need, I'll ask to Alex what's best and then make the change in case [06:38:30] how soon do you need this? [06:38:33] yesterday? [06:38:35] :D [06:38:47] ha ha this week would be great so i can make sure it's all tested and ready [06:39:30] super [06:39:39] yay thanks so much! [06:39:39] I'll work on it today [06:40:00] * madhuvishy sends elukey some espresso [06:41:27] * elukey receives the SIGCOFFEE and proceeds to brew one [06:42:02] :D [06:50:13] updated the task, going to wait for Alex's answer madhuvishy [06:50:38] elukey: cool, I think I found something else too, https://phabricator.wikimedia.org/T117428 [06:51:00] We also need to turn the rsync ports on to pick up data from stat boxes I think [06:51:04] I'll amend task [06:51:16] ah yes there is a specific term for rsyncs too [06:51:47] oh? [06:53:53] yes we have it for example for eventlog1002, etc.. [06:54:11] so we can easily add a new ip in there [06:54:39] anyhow, I think that we can manage to fix this in 1 maximum 2 days tops [06:54:45] so you'll be unblocked [06:56:01] cool, thank you - I amended the description for the rsyncs [06:56:58] gotta go sleep now, but feel free to ping on the ticket if anything comes up and I'll respond in the morning! thanks again elukey :) [06:57:33] gnight!! [07:02:57] 10Analytics-Kanban, 10Google-Summer-of-Code (2018): [Analytics] Improvements to Wikistats2 front-end - https://phabricator.wikimedia.org/T189210#4048798 (10srishakatux) This message is for students interested in working on this project for #gsoc2018 * Student application deadline is **March 27 16:00 UTC**. *... [08:35:00] geocode data on webrequest is so awesome [08:35:18] you guys are great :) [08:36:01] (you == all the ones spending time and effort making this working) [09:54:10] so https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/Metrics.html seems nice [09:59:07] 10Analytics, 10Analytics-Data-Quality, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Please review: public data sets for the WDCM Biases Dashboard - https://phabricator.wikimedia.org/T189653#4049110 (10Aklapper) [09:59:50] added some for the namenodes [11:12:48] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: Restricting access for a collaboration nearing completion - https://phabricator.wikimedia.org/T189341#4049311 (10Vgutierrez) @DarTar this also affects users @Daniela.paolotti and ciro, right? [11:31:01] 10Analytics: Functionality to share & view SWAP notebooks - https://phabricator.wikimedia.org/T156934#2990369 (10Neil_P._Quinn_WMF) My current workaround for this is to track my projects in Git (which is a good practice generally) and then push them to GitHub, which has pretty good support for viewing Jupyter no... [11:32:46] errand + lunch, will be back in ~2h (ping me if needed of course) [12:37:27] Hey dsaez - is now a good moment? [12:37:36] hi joal, sure [12:37:44] great :) [12:37:49] dsaez: batcave or irc? [12:38:20] irc is good :) [12:38:29] ok [12:39:22] so... partitionByKey [12:39:48] So about partitioning - The reason I'm asking for clarifications is because this term represents different things depending on tech layers (shuffle, files), and stack (spark, hive) [12:40:12] yes. [12:40:49] partitionByKey in spark is about shuffling data into buckets [12:41:27] And as explained in the articled you mentioned, if using the same partitioner (partition on he same key, with the same number of partitions), then there can be perf improvements [12:41:34] on joins [12:42:11] yes, I've tried that, with my own partitioner, and works really good. [12:42:20] oh yes :) [12:43:08] Secondary-sorting is actually using the same aspects of things to make use of partitoining and sorting efficiently on the platform [12:43:24] However, when speaking about datasets as files, partitionin [12:43:45] is about splitting the dataset in sub-portions that can be queried independently [12:44:07] ok, i see [12:44:30] So, with an example: we use partitioning by wiki when importin [12:44:34] gg the raw data [12:45:12] data is stored in folders like: wiki_db=enwiki/datafiles, wiki_db=wikidatawiki/datafiles etc [12:45:29] Like that, when querying for a single wiki, well you only read the specific forlder [12:45:36] Super useful for reducing IOs [12:46:07] We partition by webrequest_source and date (year, month, day, hour) the webrequest table for instance [12:46:36] got it, this is what we use on HIVE queries, for example. [12:46:44] Very correct [12:46:59] It allows for smaller data to be read for a given query [12:48:08] partitionByKey as we discussed before is different - In hive it actually has a different name (this is good, since it's not the same thing) -- CLUSTERED BY [12:48:50] ok. [12:49:02] I think i understood the diference. [12:49:03] When you have a hive table clustered by a given key (a a certain number of buckets), you can easily sample [12:49:10] Cool [12:50:03] so, what do you think would be the right strategy to do joins between the the parquet dumps and hive tables? or is that impossible is we use the full english dump? [12:50:15] At the end, the reason I don't think partitioning by key in advance will help is because partitioners may differ, number of partitions may differ, and when reading data, spark actually doesn't know that this data has been written partitionedBy [12:51:08] You want to join parquet_dumps with mediawiki_history for instance [12:51:14] exactly [12:51:49] hm - Parquet_dumps are partitionned by wiki, so doing on single wikis at a time is a first step [12:52:02] sure. [12:52:38] Then, I don't have a better idea than joining by revision_id [12:53:06] if i use workers with 8GB, I get OOM error very quickly [12:53:20] I've tried directly in scala, same problem. [12:54:06] If as you suggest, we pre-partition the dumps data, there is a good possible improvement, but it involves lower level coding, since simple join would not do [12:54:21] dsaez: This sounds bizarre [12:54:29] dsaez: how much memory for the driver? [12:54:46] 8GB two, I've also tried with 16GB [12:54:58] dsaez: Also, given huge join, a good amount of memory-overhead for workers would help [12:55:25] but eeven, when I've tried with 42GB, i didn't get OOM , but it was incredible slow, I kill the process after 2 hours I think. [12:55:56] if I delete the column with the text, and just keep revisions_id, It fast, more over doing the partitionByKey [12:56:32] The thing I also dislike with pre-partioning data is that it;s of no use for people not knowing about it (spark doesn't do optimisation based on that by itself), and it also 'turns' the dataset into a specific use-case [12:56:36] hm [12:56:53] dsaez: transform XML dumps to parquet takes about 2 days [12:56:57] but, just keeping the result of my parsing (in this case [[wikilinks]]) I get some problem. [12:57:05] joal: I see [12:57:18] So i wouldn't be surprised it would take a big amount of time for joining (knowing it shuffles the data) [12:57:52] hm [12:58:29] one Idea that I've tried, it was to create multiple Dataframes, separated by revision_id ranges. [12:58:31] Ok - I'm gonna enforce partitioning by rev_id for the dumps when transforming them [12:58:52] dsaez: I had the same idea for making queries smaller [12:59:15] dsaez: unfortunately it makes the data less easily usable [12:59:24] joal: that's true. [13:00:00] dsaez: Do you have some code as an example for me to look at (links extraction _ join) ? [13:00:39] but there is any way to this process faster, after reading the full dataframe? like an index in MariaDB? [13:00:52] joal: yes, let my upload to github [13:01:10] let me clean, I was playing on the spark-shell [13:01:28] dsaez: Not sure what you mean by "this process faster" - links extraction? [13:01:35] sure [13:01:54] no, dividing the dataframe by rev_id [13:02:38] Something else to notice dsaez - In the last patch, I added all of the available dump data with the text in he parquet dump [13:02:59] joal: that is also good. [13:03:18] joal: in fact, that solve all my problems. [13:03:23] :) [13:04:24] you'll have timestamp, page_title, page_redirect_title, user_id, user_text, sha1, etc) (see https://gerrit.wikimedia.org/r/#/c/361440/6/src/main/scala/org/wikimedia/wikihadoop/job/XmlConverter.scala) [13:05:31] great. Is that script running now? [13:05:36] dsaez: Joining with mediawiki_history will stioll be very usefull for some precomputed fields (revert info, count-by user or page info) as I think of them [13:06:16] dsaez: no - I was planning to wait for our next dump gathering [13:06:48] joal: yes! although that the new dump will solve my current problems, finding a good way to do this joins, it would be helpful in the future, for sure. [13:07:45] dsaez: If you agree, I'll try to join by rev_i on enwiki from a dataset of extracted links and mediawiki_history [13:08:07] joal: sound good to me. Thanks [13:08:17] dsaez: expermienting wih that will also help me having a better undeerstanding of the problem :) [13:08:40] dsaez: If you have scala code that extracts links the way you like, can you give ti me please? [13:09:37] joal: no, I don't, what I can do is save the dataframe that I parse with pyspark, and then give it two you for joining [13:09:54] dsaez: I'd do that in any case :) [13:10:27] joal: it is bassically one regular expression, let me see if I can code it in scala. I image that wont be very different. [13:10:53] dsaez: this also gives me a hint - Since you use a dataframe and a UDF, data is partitionned with the default spark-SQL setting which is 200 [13:10:59] This is too small for enwiki [13:11:12] dsaez: wouldn't I'm sure [13:11:38] I've change that, somewhere, I think. let me see [13:17:39] dsaez: you sent me e.findall("\[\[(.*?)\]\]",wikitext) the other day [13:17:45] dsaez: is that good? [13:18:07] joal: https://github.com/digitalTranshumant/Wiki-examples/blob/master/CreateLinkGraphPython2.ipynb [13:18:37] dsaez: Ok looks like what I have is good :) [13:18:39] joal: wait, that's the old version, let me update [13:18:47] Arf, ok waiting :) [13:20:09] joal: now [13:20:09] https://github.com/digitalTranshumant/Wiki-examples/blob/master/CreateLinkGraphPython2.ipynb [13:20:12] cleaner [13:20:36] ok great :) [13:22:33] df = spark.read.parquet('hdfs:///user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2018-01/enwiki') [13:22:34] df.rdd.getNumPartitions() [13:22:34] 65558 [13:22:52] yessir [13:23:19] Now, when you work that as a RDD an apply a UDF on top of it, I think the partition numbers reduces to 200 [13:53:12] joal: nop [13:53:15] df2.rdd.getNumPartitions() [13:53:15] Out[11]: 65558 [13:53:42] Great - map only function [13:58:59] dsaez: I'm assumoing ou're only interested in revisions having links, correct? [14:02:37] joal: hmmm, if that makes the difference is ok, but in principle i would like to know when there are no links too [14:02:49] added some RPC-related graphs for RM and NM to https://grafana.wikimedia.org/dashboard/db/analytics-hadoop [14:02:55] :) [14:03:12] joal: but i can run a different tak [14:03:20] dsaez: nah, should be ok [14:03:21] cool! [14:03:22] task for that [14:03:34] elukey: i think we should not block on coal for el -> jumbo [14:03:45] i think we should keep the zmq up running from jumbo when we migrate for abit [14:03:49] so they can fall back if they need to [14:04:08] morning :) [14:04:19] hiii :D [14:04:38] wdyt about having a second zmq-forwarder on eventlog1001? [14:04:50] why not just switch the current one when we switch the rest of eventlogging? [14:04:55] then their system will work exactly the same [14:04:58] nothing will change for htem [14:05:12] it makes more sense yes :) [14:05:24] elukey: another reason it might be good to switch first: https://phabricator.wikimedia.org/T183297#4047029 [14:05:29] so they don't have to deal with offsets changing on them [14:05:40] if we get them started from jumbo first thing [14:06:07] sure [14:06:44] can we do it now :D [14:06:45] irc is fine :) [14:06:45] do we need to leave the zmq-forwarder running for a bit after the migration to jumbo to consume all the topic partitions? [14:07:01] even hangouts, today less noise [14:07:06] k [14:07:08] lemme grab a coffee and in 5m we can start [14:07:10] ok? [14:07:13] yea, like all the other els [14:07:13] ya [14:07:16] •  Let EventLogging and webperf consume and process all remaining events in eventlogging-client-side topic in analytics Kafka. [14:07:31] makes sense [14:12:07] ottomata: indacave wheneveryouwant [14:30:19] (03PS5) 10Fdans: Responsive Wikistats 2 UI [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/416999 [14:32:08] (03CR) 10jerkins-bot: [V: 04-1] Responsive Wikistats 2 UI [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/416999 (owner: 10Fdans) [14:44:57] !log beginning migration of eventlogging analtyics from Kafka analytics to Kafka jumbo: T183297 [14:45:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:45:05] T183297: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297 [14:59:11] a-team - Time-change conflict again today - Will be there in about an hour [15:07:46] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4050053 (10Ottomata) [15:10:44] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4050060 (10Ottomata) @Imarlier disregard my earlier comment about auto.offset.reset. Since we... [15:13:58] joal: np, we are going to do ops-sync now standup is in 45 mins [15:14:56] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#4050070 (10Ottomata) [15:16:35] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#3594088 (10Ottomata) [15:17:06] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Migrate mjolnir Kafka clients to use Kafka jumbo - https://phabricator.wikimedia.org/T188408#4006750 (10Ottomata) Cool, this is done?! [15:17:17] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#4050081 (10Ottomata) [15:27:21] !log bouncing main -> jumbo mirror maker instances [15:27:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:37:26] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Check detached accounts in DB with same username for "mediawiki" and "phab" sources but different uuid's (and merge if connected) - https://phabricator.wikimedia.org/T170091#4050175 (10Aklapper) 05stalled>03declined The number of suc... [15:40:21] 10Analytics: [EL Sanitization] Translate TSV whitelist into new YAML whitelist - https://phabricator.wikimedia.org/T189690#4050193 (10mforns) [15:43:39] 10Analytics: [EL sanitization] Ensure presence of EL YAML whitelist in analytics1003 - https://phabricator.wikimedia.org/T189691#4050215 (10mforns) [15:46:16] 10Analytics: [EL sanitization] Modify mysql purging script to read from the new YAML whitelist - https://phabricator.wikimedia.org/T189692#4050230 (10mforns) [15:58:13] (03PS6) 10Fdans: Responsive Wikistats 2 UI [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/416999 [16:01:12] ping joal fdans [16:04:20] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Sanitize Hive EventLogging - https://phabricator.wikimedia.org/T181064#4050284 (10mforns) [16:05:45] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4050287 (10Ottomata) [16:05:59] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#3849554 (10Ottomata) [16:25:17] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Migrate mjolnir Kafka clients to use Kafka jumbo - https://phabricator.wikimedia.org/T188408#4050410 (10dcausse) 05Open>03Resolved a:03dcausse yes! [16:25:19] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#4050413 (10dcausse) [16:30:21] 10Analytics-Kanban: [EL Sanitization] Translate TSV whitelist into new YAML whitelist - https://phabricator.wikimedia.org/T189690#4050431 (10mforns) [16:31:16] 10Analytics-Kanban: [EL Sanitization] Translate TSV whitelist into new YAML whitelist - https://phabricator.wikimedia.org/T189690#4050193 (10mforns) a:03mforns [16:33:44] ottomata: mm seems stable now [16:33:49] let's see how it goes [16:34:01] ya [16:34:22] might puppetize that blacklist change [16:34:32] if this makes things stable, we can wait for job topics until we upgrade main [16:40:52] heya team - sorry couldn't make it earlier :( [16:45:04] np! [17:10:52] elukey: https://gerrit.wikimedia.org/r/#/c/419482/ [17:11:02] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4050641 (10Ottomata) [17:11:10] lunch, bbl [17:14:02] ah yes! [17:40:44] ottomata: I almost ported the eventlogging dashboards but they are some bits missing [17:40:59] if I don't manage to do them today I'll finish tomorrow morning [17:41:02] is it ok? [17:41:46] oh ya no problem no hurry at alll [17:46:48] https://grafana-admin.wikimedia.org/dashboard/db/eventlogging-jumbo - didn't do the schema percentage but the others works (more or less) [17:48:35] all right logging offf [17:48:44] see you tomorrow people :) [17:50:39] see yaaaaa [17:50:42] thanks elukey! [17:53:42] 10Analytics, 10Analytics-Cluster: Migrate eventbus camus to Kafka jumbo - https://phabricator.wikimedia.org/T189713#4050809 (10Ottomata) p:05Triage>03Normal [17:53:55] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#4050824 (10Ottomata) [17:56:08] 10Analytics, 10Analytics-Cluster: Migrate EventStreams to Kafka Jumbo - https://phabricator.wikimedia.org/T189716#4050851 (10Ottomata) p:05Triage>03Normal [17:56:19] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#4050867 (10Ottomata) [17:56:34] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#3594088 (10Ottomata) [18:16:59] 10Analytics, 10Analytics-EventLogging, 10Readers-Web-Backlog: Should it be possible for a schema to override DNT in exceptional circumstances? - https://phabricator.wikimedia.org/T187277#4051001 (10phuedx) >>! In T187277#3973952, @Ottomata wrote: >> Given that EventLogging collects neither unique identifiers... [18:19:16] hi lzia :] can you confirm that this patch is good: https://gerrit.wikimedia.org/r/#/c/405727/ [18:19:33] * lzia checks the path for mforns [18:19:48] lzia, I take responsibility for its functionality, just wanted to be sure those fields are the ones we talked about :] [18:19:53] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Migrate eventbus camus to Kafka jumbo - https://phabricator.wikimedia.org/T189713#4051007 (10Ottomata) [18:20:18] mforns: makes sense. give me a couple of min, please [18:20:30] sure! [18:22:29] mforns: looks good to me. [18:24:14] lzia, ok, thanks! [18:26:35] elukey, can you merge this whitelist patch please? https://gerrit.wikimedia.org/r/#/c/405727/ I just wrote it and lzia agreed, for more details: https://phabricator.wikimedia.org/T174386#4036065 [18:26:49] elukey, merge if LGTY, of course [18:27:10] oh... elukey is gone, hehehe [18:27:22] ottomata? :D ^ [18:27:26] can you do that, please? [18:28:28] done [18:30:05] :] [18:32:55] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4051086 (10Ottomata) [18:53:28] 10Analytics-Kanban, 10Operations, 10ops-eqiad: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4051156 (10Cmjohnson) No errors again today [19:00:24] (03PS1) 10Ottomata: Scripts to build jupyterhub based SWAP [analytics/swap/deploy] - 10https://gerrit.wikimedia.org/r/419507 (https://phabricator.wikimedia.org/T183145) [19:01:13] (03CR) 10Ottomata: [V: 032 C: 032] Scripts to build jupyterhub based SWAP [analytics/swap/deploy] - 10https://gerrit.wikimedia.org/r/419507 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:01:35] madhuvishy: do you use the jupyterhub puppet module in cloud vps? [19:01:39] or is that just for swap? [19:01:51] ottomata: only for swap [19:01:53] ok great [19:02:08] yeah it's all yours [19:05:05] (03PS1) 10Ottomata: Add artifacts for initial build of swap [analytics/swap/deploy] - 10https://gerrit.wikimedia.org/r/419509 (https://phabricator.wikimedia.org/T183145) [19:05:36] (03CR) 10Ottomata: [V: 032 C: 032] Add artifacts for initial build of swap [analytics/swap/deploy] - 10https://gerrit.wikimedia.org/r/419509 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:08:07] ottomata: also I'm pretty sure that nginx-extras was for the lua stuff in the proxy [19:08:27] 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4051220 (10Cmjohnson) Removing ops-eqiad project tag [19:08:32] owed to yuvi's love of proxies and lua [19:08:36] yeah madhuvishy i think i'm going to drop that [19:08:39] cool [19:08:41] go for it [19:08:45] i can pacakge the nodejs based one with the deploy repo [19:08:48] and run it [19:08:53] wihtout extra config [19:08:54] i think* [19:08:59] cool [19:09:05] https://gerrit.wikimedia.org/r/#/c/419507/ [19:09:20] madhuvishy: i'm also going to include jupyterlab because i can and its easy and why not! :o [19:09:28] cool! [19:09:55] there's some extension that makes jupyterlab work better with jupyterhub (like adds some login/logout menus? that seems weird to package, because it needs to be installed via jupyter extension installer or osmething? not doing that for now) [19:15:33] (03PS1) 10Ottomata: Rename wheels_dir -> wheels [analytics/swap/deploy] - 10https://gerrit.wikimedia.org/r/419510 (https://phabricator.wikimedia.org/T183145) [19:15:41] (03CR) 10Ottomata: [V: 032 C: 032] Rename wheels_dir -> wheels [analytics/swap/deploy] - 10https://gerrit.wikimedia.org/r/419510 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:30:17] 10Analytics, 10Analytics-EventLogging, 10Readers-Web-Backlog: Should it be possible for a schema to override DNT in exceptional circumstances? - https://phabricator.wikimedia.org/T187277#4051320 (10Nuria) @phuedx not easily. We implement policies "per-data-type" in this case "schema-bound-data" eventlogging... [19:56:38] headed out for a bit team, will work a llittle bit later [19:59:51] 10Analytics-Kanban: [EL Sanitization] Translate TSV whitelist into new YAML whitelist - https://phabricator.wikimedia.org/T189690#4051389 (10mforns) I wrote the script that translates the old TSV whitelist into the new YAML format. I did not push the code to any repo, because this is a one-off. Here's the code:... [20:03:05] 10Analytics-Kanban: [EL Sanitization] Translate TSV whitelist into new YAML whitelist - https://phabricator.wikimedia.org/T189690#4051392 (10mforns) The resulting new EL whitelist in YAML format, that supports partial purging of nested fields is: {F15424632} [20:03:54] 10Analytics-Kanban: [EL Sanitization] Translate TSV whitelist into new YAML whitelist - https://phabricator.wikimedia.org/T189690#4051394 (10mforns) [20:03:56] 10Analytics-Kanban, 10Patch-For-Review: Wikistats: labeling of pageviews is wrong on table and graph - https://phabricator.wikimedia.org/T189266#4051393 (10Nuria) {F15424955} [20:06:09] (03PS9) 10Joal: Add by-wiki stats to MediawikiHistory job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) [20:06:11] (03PS1) 10Joal: [WIP] Update mediawiki-history spark job for performance [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/419516 [20:13:15] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Update mediawiki-history spark job for performance [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/419516 (owner: 10Joal) [20:31:22] joal: yt? [20:31:25] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (designing): Support per-db-shard concurrency in ChangeProp - https://phabricator.wikimedia.org/T189738#4051504 (10Pchelolo) p:05Triage>03High [20:38:20] 10Analytics: unique devices data for january not in cassandra - https://phabricator.wikimedia.org/T189740#4051542 (10Nuria) [20:53:29] 10Analytics: unique devices data for january not in cassandra - https://phabricator.wikimedia.org/T189740#4051630 (10Nuria) cassandra@cqlsh> select * from "local_group_default_T_unique_devices".data where timestamp = '20180201' and project='es.wikipedia' and granularity='monthly' and "_domain"='analytics.wikimed... [20:58:43] 10Analytics, 10ChangeProp, 10EventBus, 10Services (later): RESTBase content rerenders sometimes don't pick up the newest changes - https://phabricator.wikimedia.org/T176412#4051660 (10mobrovac) [20:59:09] (03CR) 10Nuria: "Let's please add a phab ticket . Also, these changes seem quite substantial , can we run job for 1 wiki and compare results to current sna" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/419516 (owner: 10Joal) [21:12:16] milimetric: hi. question for you. (I'm pretty sure the answer is "we don't have it" but just to make sure:) Do we have hourly pageview data for Chinese Wikipedia pages from mainland China and from May 2015? [21:12:53] milimetric: sampled or unsampled can both be relevant. [22:54:49] lzia: milimetric is off this wek [22:55:09] nuria_: thanks for letting me know. [22:55:16] nuria_: any chance you know the answer? [22:55:20] lzia: pageview data in hadoop starts in june 2015 [22:55:42] lzia: so we have that data for june but not may [22:56:06] lzia: actually 1st full month i think is july [22:56:32] nuria_: do we have the data aggregated by hour, geolocation, and Wikipedia page_id? [22:56:41] lzia: yes [22:57:05] nuria_: got you. and before that point in time, didn't we collect sampled data? [22:58:37] lzia: before that we have project counts and article counts not sampled but neither have a geographical dimension [22:59:22] right. thanks, nuria_. the timing is unfortunate, cuz the study focused on the changes around China's block of Wikipedia in May 2015. [23:00:04] lzia: there are several studies that do that using article pageviews as a proxy [23:01:44] nuria_: yeah. I'm talking with the people behind http://jenpan.com/jen_pan/censored.pdf and http://jenpan.com/jen_pan/experiment.pdf They were wondering if they can deepen the knowledge in this space by digging in the data more. [23:01:50] too bad we don't have the data. [23:02:53] thank you, nuria_.