[00:27:21] (CR) Ori.livneh: [V: 2] Allow to specify date to compute the report for [analytics/blog] - https://gerrit.wikimedia.org/r/180158 (owner: QChris) [00:27:31] (CR) Ori.livneh: [V: 2] Add basic tox setup [analytics/blog] - https://gerrit.wikimedia.org/r/180159 (owner: QChris) [00:27:41] (CR) Ori.livneh: [C: 2 V: 2] Move blogreport code behind a __main__ guard [analytics/blog] - https://gerrit.wikimedia.org/r/180160 (owner: QChris) [00:27:50] (CR) Ori.livneh: [C: 2 V: 2] Add tests for parsing string to date [analytics/blog] - https://gerrit.wikimedia.org/r/180161 (owner: QChris) [00:40:12] (PS1) Ori.livneh: Add overall counts for URLs [analytics/blog] - https://gerrit.wikimedia.org/r/180369 [01:13:38] leila, yt? [01:14:16] nuria__: yes. [01:14:24] nuria__: what's up? [01:15:00] i am trying to productionize Ironholds code to callculate app uniques from webrequest table in hive [01:15:34] leila: for perf reasons i would like to avoid scanning 1 month of data if possible [01:16:23] leila: so i want to do some random sampling that will render data with an apropiate level of confidence [01:16:32] leila: so far makes sense? [01:17:11] question: we want the number of uniques per month? [01:17:27] leila: yes [01:17:42] leila: so it is a 'counting' problem [01:17:52] leila: no more complex than that [01:17:54] you want to do this in hive or not necessarily? [01:18:35] leila: we will do it with hive random sampling but first i wanted to ask if the aproximation of [01:19:25] sample size like simple: http://en.wikipedia.org/wiki/Binomial_distribution#Confidence_intervals [01:19:56] (I'm trying to understand if approximation is needed. it may be needed if you stick to hive. I'm wondering if you consider other options like pig or Oozie at the moment. [01:19:58] sorry, a simple calculation [01:20:47] this will run with oozie, but my concern is not to have to look at 1 month of data if it is not needed, say, if i can calculate [01:20:59] I understand [01:21:16] the number with a confidence of 95% looking at -let's say- 60% of teh dataset [01:21:33] the query consumes less time & resources [01:21:36] I understand. please continue. [01:22:11] so trying to rememeber my stats from school i was trying to estimate the sample size for a "counting" problem [01:22:36] and i think [01:23:51] teh formula was pretty simple (if you had a good guess as of the true value, which we do) [01:24:10] https://www.irccloud.com/pastebin/CXJ5XBF1 [01:24:35] i just cut and pasted, hopefully this makes sense [01:24:57] nuria__: there are multiple ways for doing this. I need to think about it for some time and let you know. Would it be good if I let you know before eod today? [01:25:17] leila: no rush at all, tomorrow is just as good [01:25:52] sounds good. I'll email you then, hopefully by eod. [01:26:02] leila: Many thanks!!! [01:26:08] np, nuria__. [01:28:54] nuria__: how long is the unique ID? [01:29:12] per record in webrequest table? [01:29:16] taht one? [01:29:16] yup [01:29:18] *that [01:30:12] Analytics-Visualization, Analytics-Engineering: PM shares a deep link into Limn Dashboard - https://phabricator.wikimedia.org/T78743#852299 (kevinator) NEW [01:31:01] boy, i imagine that our id is the partition key year/month/day/hour but this otto will know [01:31:19] no worries. I'll look at it. [01:31:26] nuria__: I'm signing off. will email later. [01:31:30] ciao [01:31:36] nuria__: no rush please ciao [03:47:34] Phabricator, Engineering-Community, Analytics-Tech-community-metrics: Monthly report of total / active Phabricator users - https://phabricator.wikimedia.org/T1003#852435 (Aklapper) >>! In T1003#851771, @chasemp wrote: > can I get on this list? You need to bribe mutante. >>! In T1003#851812, @Qgil wrote: > I... [08:05:47] Analytics-Tech-community-metrics, Phabricator: Metrics for key Wikimedia projects software in Maniphest - https://phabricator.wikimedia.org/T28#852651 (Qgil) [09:16:24] MediaWiki-extensions-MultimediaViewer, Multimedia, Analytics: Create MediaViewer image varnish hit/miss ratio dashboard - https://phabricator.wikimedia.org/T78205#852788 (Gilles) >>! In T78205#851804, @Tgr wrote: > Yeah, I didn't think of that. The Last-Modified header of thumbnails seems match when they were... [09:18:00] Analytics-EventLogging: EventLogging ValidateSchemaTest::testValidEvent() fails under HHVM - https://phabricator.wikimedia.org/T78680#852791 (hashar) Any idea why it would fail under HHVM and not under Zend ? One sure thing, the test pass now: https://integration.wikimedia.org/ci/job/mwext-EventLogging-teste... [09:27:55] MediaWiki-extensions-MultimediaViewer, Multimedia, Analytics: Create MediaViewer image varnish hit/miss ratio dashboard - https://phabricator.wikimedia.org/T78205#852821 (Gilles) Actually I see that there's a way to tell this only with headers, no need to calculate the local time difference. The "Age" header... [09:30:25] MediaWiki-extensions-MultimediaViewer, Multimedia, Analytics: Add Last-Modified and Date to performance logging - https://phabricator.wikimedia.org/T78767#852822 (Gilles) NEW a:Gilles [09:30:41] MediaWiki-extensions-MultimediaViewer, Multimedia, Analytics: Create MediaViewer image varnish hit/miss ratio dashboard - https://phabricator.wikimedia.org/T78205#852835 (Gilles) T78767 [09:34:22] MediaWiki-extensions-MultimediaViewer, Multimedia, Analytics: Add Last-Modified and Date to performance logging - https://phabricator.wikimedia.org/T78767#852842 (Gilles) Actually, as @tgr pointed out, Varnish's X-Timestamp is the same as Last-Modified, and we're already logging that. Assuming that the clocks... [09:46:02] MediaWiki-extensions-MultimediaViewer, Multimedia, Analytics: Add Last-Modified and Date to performance logging - https://phabricator.wikimedia.org/T78767#852853 (Gilles) Actually timestamp != Date for one very obvious reason: the EL event will only be recorded after the image load, and will depend on latency... [10:15:47] MediaWiki-extensions-MultimediaViewer, Multimedia, Analytics: Add Last-Modified and Date to performance logging - https://phabricator.wikimedia.org/T78767#852879 (Gilles) Saving this for later: P163 [10:26:13] MediaWiki-extensions-MultimediaViewer, Multimedia, Analytics: Add Last-Modified and Date to performance logging - https://phabricator.wikimedia.org/T78767#852888 (Gilles) Ah, it turns out that the "timestamp" column IS the Date header. So we only need Last-Modified. [10:30:58] MediaWiki-extensions-MultimediaViewer, Multimedia, Analytics: Add Last-Modified to performance logging - https://phabricator.wikimedia.org/T78767#852904 (Gilles) [10:31:32] MediaWiki-extensions-MultimediaViewer, Multimedia, Analytics: Add Last-Modified to performance logging - https://phabricator.wikimedia.org/T78767#852822 (Gilles) [11:11:50] Analytics-Tech-community-metrics, Engineering-Community, Phabricator: Monthly report of total / active Phabricator users - https://phabricator.wikimedia.org/T1003#852957 (Dzahn) ``` Hi Community Metrics team, this is your automatic monthly Phabricator statistics mail. Number of active users (any activity) i... [11:14:52] Analytics-Tech-community-metrics, Engineering-Community, Phabricator: Monthly report of total / active Phabricator users - https://phabricator.wikimedia.org/T1003#852958 (Dzahn) >>! In T1003#852435, @Aklapper wrote: >>>! In T1003#851771, @chasemp wrote: >> can I get on this list? > > You need to bribe mutant... [11:30:34] Analytics, MediaWiki-extensions-MultimediaViewer, Multimedia: Investigate if pre-rendering images is having an impact on performance - https://phabricator.wikimedia.org/T76035#852986 (fgiunchedi) yep clients are geo-located to the closest datacenter via dns, so different cp machines get very different clients... [11:31:05] Analytics-Tech-community-metrics, Engineering-Community, Phabricator: Monthly report of total / active Phabricator users - https://phabricator.wikimedia.org/T1003#852987 (Dzahn) >>! In T1003#852435, @Aklapper wrote: > but using $sql_name in the script won't work here as that is on the "phabricator_user" DB in... [11:34:31] Analytics-Tech-community-metrics, Engineering-Community, Phabricator: Monthly report of total / active Phabricator users - https://phabricator.wikimedia.org/T1003#852990 (Dzahn) >>>! In T1003#851249, @Aklapper wrote: >> Uh, //daily//? The idea was monthly (see topic), otherwise the queries using INTERVAL 1 MO... [11:39:41] Analytics-Tech-community-metrics, Engineering-Community, Phabricator: Monthly report of total / active Phabricator users - https://phabricator.wikimedia.org/T1003#853006 (Dzahn) [11:39:42] Analytics-Tech-community-metrics, Phabricator: SQL user/grant for phabricator statistics script - https://phabricator.wikimedia.org/T78311#853004 (Dzahn) Resolved>Open re-opening, because now we have an additional requirement. In T1003 we have been asked to add another query but it access a different da... [12:00:36] Analytics-Engineering: WebStatsCollector's pageviews definition should have a UDF - https://phabricator.wikimedia.org/T78779#853038 (Ironholds) NEW a:Ironholds [12:15:09] qchris, sorry didn't answer yesterday in time. Re "text" - yes, we plan to analyze future desktop traffic from carriers, once we start tagging it in varnish [12:15:40] yurikR: no worries. I'll respond to the email today. [12:16:00] qchris, re who needs it - Dan & the rest of zero team needs it so that can officially start using our graphs instead of limn [12:16:26] and so that in case we are asked about numbers, we can say that you at least looked over our procedure and found it reasonable [12:37:11] Analytics-General-or-Unknown: datasets.wikimedia.org SSL error - https://phabricator.wikimedia.org/T74805#853086 (QChris) Open>Resolved a:QChris ottomata moved stat1001 behind misc-web. Now SSL is handled before the request gets to stat1001, and the issue is gone. [13:28:36] (CR) QChris: [C: -1] "Only Nits." (3 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/180305 (owner: Ottomata) [13:51:34] (CR) Hashar: "recheck" [analytics/blog] - https://gerrit.wikimedia.org/r/180369 (owner: Ori.livneh) [13:51:42] (CR) jenkins-bot: [V: -1] Add overall counts for URLs [analytics/blog] - https://gerrit.wikimedia.org/r/180369 (owner: Ori.livneh) [13:52:17] bah qchris :-( [13:52:29] analytics/blog tox jobs are failling [13:52:32] Ouch :-( [13:52:37] They passed locally. [13:52:43] * qchris checks again [13:53:25] might just be the patch https://gerrit.wikimedia.org/r/#/c/180369/ [13:54:20] (PS1) QChris: [DO NOT SUBMIT] Empty test for CI [analytics/blog] - https://gerrit.wikimedia.org/r/180472 [13:54:26] (CR) jenkins-bot: [V: -1] [DO NOT SUBMIT] Empty test for CI [analytics/blog] - https://gerrit.wikimedia.org/r/180472 (owner: QChris) [13:54:46] That one succeeds for me locally :-) [13:54:49] (PS2) Hashar: Add overall counts for URLs [analytics/blog] - https://gerrit.wikimedia.org/r/180369 (owner: Ori.livneh) [13:54:54] (CR) jenkins-bot: [V: -1] Add overall counts for URLs [analytics/blog] - https://gerrit.wikimedia.org/r/180369 (owner: Ori.livneh) [13:55:21] (CR) Hashar: "I have fixed the flake8 errors. The other tox env failing is being investigated." [analytics/blog] - https://gerrit.wikimedia.org/r/180369 (owner: Ori.livneh) [13:55:54] sh: 1: mysql_config: not found [13:55:54] hashar: https://integration.wikimedia.org/ci/job/analytics-blog-tox-py27/2/console [13:55:59] Yes :-/ [13:56:06] seems building sqlalchemy requires the mysql-dev package or something [13:56:18] Mhmm. [13:57:06] I remember I had to install mysql on my laptop to be able to build sqlalchemy [13:57:29] I guess that's nothing tox can cater for :-( [13:57:39] libmysqlclient-dev would do [13:57:46] just have to install it on the slaves [13:57:57] one day folks will be able to add in their test sudo apt-get install libmysqlclient-dev [13:58:00] Would that be ok to install? [13:58:03] meanwhile that has to be done in puppet [13:58:15] yeah doing so [13:58:21] we already have a bunch of -dev packages [13:58:25] Ok. Awesome! :-) [13:58:32] * hashar pesters at slow internet connection [13:58:59] ideally we would have an up-to-date python-sqlalchemy package installed which tox would use instead of compiling from pip [14:06:15] qchris: fixed :] [14:06:25] (CR) Hashar: "recheck" [analytics/blog] - https://gerrit.wikimedia.org/r/180472 (owner: QChris) [14:06:26] Awesome! [14:07:34] It's green! [14:07:35] \o/ [14:07:42] Thanks hashar! [14:07:42] (Abandoned) Hashar: [DO NOT SUBMIT] Empty test for CI [analytics/blog] - https://gerrit.wikimedia.org/r/180472 (owner: QChris) [14:07:53] (CR) Hashar: "The CI slaves were lacking libmysqlclient-dev which I have installed via puppet." [analytics/blog] - https://gerrit.wikimedia.org/r/180369 (owner: Ori.livneh) [14:07:59] (CR) Hashar: "recheck" [analytics/blog] - https://gerrit.wikimedia.org/r/180369 (owner: Ori.livneh) [14:47:31] (CR) QChris: [C: -1] [WIP] UDF for classifying pageviews according to https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters (25 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [14:52:21] yay, christian comments :D [14:52:40] :-P [14:58:54] (CR) OliverKeyes: [WIP] UDF for classifying pageviews according to https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters (3 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [15:14:27] (PS2) Ottomata: Catch ValueError raised by hive.partition_datetime_from_path [analytics/refinery] - https://gerrit.wikimedia.org/r/180305 [15:19:41] Analytics, MediaWiki-extensions-MultimediaViewer, Multimedia: Investigate if pre-rendering images is having an impact on performance - https://phabricator.wikimedia.org/T76035#853383 (Gilles) >>! In T76035#852986, @fgiunchedi wrote: > Disabling prerending and running the measurements again sounds easier to te... [15:21:43] Analytics, MediaWiki-extensions-MultimediaViewer, Multimedia: Investigate if pre-rendering images is having an impact on performance - https://phabricator.wikimedia.org/T76035#853385 (Gilles) Which regions do the cp4xxx servers cover, out of curiosity? [15:27:55] Analytics, MediaWiki-extensions-MultimediaViewer, Multimedia: Investigate if pre-rendering images is having an impact on performance - https://phabricator.wikimedia.org/T76035#853390 (Gilles) Nevermind, I answered my own question by looking at the data: P164 It seems to be predominantly Asia, with Taiwan taki... [15:34:00] Ironholds: yt? [15:34:19] nuria__, I deny everything [15:34:27] Ironholds: jajaja [15:35:07] Ironholds: i just have a question: the parse_url() in the query: https://gist.github.com/Ironholds/428014d22edb7969ff5c [15:35:41] yup? [15:35:43] it's one of the apache udfs right? you did not have [15:35:58] yeah, it's an inbuilt hive UDF [15:36:00] another jar for that udf [15:36:05] ok, thank youu [15:36:11] it requires an actual URL body, hence the CONCAT, but is otherwise mostly sane [15:36:22] amusingly I ended up re-implementing the UDF in C++ for another project :D [15:37:17] Ironholds: ok, talked to leila a bit yesterday about the sampling to see if we can use one of teh simple formulas for sampling sizes of "counting" problems so we can reduce a bit the swap through the table [15:37:27] gotcha [15:37:31] Ironholds: but thus far running teh query i just get OOMs [15:37:43] oh, yes. The wonders of the hive client :D [15:37:43] nuria__: I'm on it. comparing couple of different methods [15:37:50] export HADOOP_HEAPSIZE=1024 before launching hive [15:38:09] dep_hive_query automatically does this before any query, because I run into the problem so often, even with simple stuff :/ [15:38:28] ottomata, while we're talking hardware, did I hear something about new machines for something-or-other in the dev sync-up, or was I mishearing? [15:39:12] Ironholds: do you have a query troubleshooting wiki or should i start one? [15:39:12] that was new namenodes, to replace the ciscos [15:39:14] no new capacity [15:39:40] nuria__: Ironholds, I am worried about OOM errors. if they happen on the client side, not sucha big deal, because you can increase HADOOP_HEAPSIZE as oliver notes. [15:39:48] if they happen on the hive server side...I am not yet sure [15:40:04] ottomata: agreed, that is why i was trying to look at alogs [15:40:52] nuria__: are you getting an application_id? i.e. your job is actually launching in hadoop? [15:41:24] ottomata: no, it was not [15:41:31] nuria__, please do start one! [15:41:36] ooh, that's WEIRD. [15:41:42] wait. nuria__ how are you running it? [15:42:05] Ironholds: from command line: time hive -f select-app-uniques.sql >& output.txt [15:42:12] well, there goes my theory [15:42:21] so, this has been happening ore and more to me recently, too. I don't know why. [15:42:23] nuria, can you tell if it is accepted by the hive server? [15:42:27] Ironholds: that is the problem we were having with your pageviews query before we turned it into a udf...if [15:42:28] yeah [15:42:28] As of a couple of days ago, there are massive lags.[ [15:42:32] yeah [15:42:43] but it also happened to a totally unrelated query that was for legal [15:42:57] it stops just before it assigns an appID and predicts reducer counts. [15:42:59] and just...freezes. [15:44:54] Ironholds: did that query ever run? [15:44:59] i remember you left it at the end of the day [15:45:55] it did, for some reason, work when I ran it through R. [15:46:08] even though that's just a system call to hive -f temp_query_file.hql > temp_output_file.tsv [15:46:18] ottomata: i think the task is accepted by the hive server : [15:46:21] https://www.irccloud.com/pastebin/UUQALaxF [15:47:51] nuria__: how large you expect the number of uniques in a month be? [15:51:02] leila: we know it will be (for all apps) about 9 million as oliver gather that data earlier [15:51:25] I see. thanks nuria__. [15:51:25] leila: the whole dataset is about 2500* 10^6 for 1 month [15:51:53] Ironholds: troubleshooting doc started: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Troubleshooting [15:52:10] ta! [15:52:16] ah yes [15:52:20] that is the same problem Ironholds has [15:52:25] so yes, your job is accepted by the hive server [15:52:30] but it not getting launched in hadoop [15:52:35] notice that it says 999 reducers. [15:52:37] that is not good [15:52:42] nuria__: what is your query? [15:53:16] nuria__: wanna get on the call? [15:54:16] ottomata: the one i got from Ironholds : https://gist.github.com/Ironholds/428014d22edb7969ff5c [15:55:29] leila: give me a sec to see if we can figure out the query issues [15:55:43] sure. just ping me when you're ready nuria__ [15:55:57] leila: thank you! [16:26:52] ottomata: looks like incrementing the heapsize on the client makes hive able to send the task to hadoop [16:26:56] ottomata, JFYI I'm working on patching for qchris's comments on that patch, btw [16:26:57] Analytics-Engineering, Analytics-Wikimetrics: Re-run Wikimetrics data once Labs issues are fixed [8 pts] - https://phabricator.wikimedia.org/T78305#853495 (Milimetric) [16:27:04] ...that came out all unreadable. [16:27:36] Analytics-Engineering, Analytics-Wikimetrics: Re-run Wikimetrics data once Labs issues are fixed [8 pts] - https://phabricator.wikimedia.org/T78305#841895 (Milimetric) The old recurrent reports are saved by the backup system, but I put just the contents of the datafiles directory here: /data/project/wikimetri... [16:27:41] hulk patch java! hulk-mata have better things to do than patch java. If hulk-mata patch java, hulk effort's duplicated and hulk sad. [16:35:40] Analytics-Refinery: Getting Ananth started - https://phabricator.wikimedia.org/T77196#853510 (kevinator) [16:35:52] actually, hulk not know how to patch some bits. [16:42:28] Analytics-Refinery: Hive User can specify webrequest date range in query more easily - https://phabricator.wikimedia.org/T76531#853519 (Ottomata) [16:42:49] (PS6) OliverKeyes: [WIP] UDF for classifying pageviews according to https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [16:44:54] (PS1) Gilles: Add scroll metadata open/close events to dashboards [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/180501 [16:46:01] nuria__: in the simple case, are you considering tablesample in hive? [16:46:29] darnit [16:46:35] how do I reply to comments left on an old commit? [16:55:37] Analytics-Engineering: Oozie tutorial - https://phabricator.wikimedia.org/T78687#853536 (kevinator) a:kevinator [16:58:23] Analytics-Engineering: Oozie tutorial - https://phabricator.wikimedia.org/T78687#853538 (ggellerman) Andrew offered to do casual question & answer session. If that doesn't cover it, we can schedule a more formal presentation (possibly when everyone is in SF for WMDS) [16:58:24] interesting, nuria__, so you got your query started? [16:58:25] did it finish? [16:58:59] whoa, there are no partitions specified on that query? [17:01:36] Analytics, Analytics-Engineering: Analytics User uses CentralNotice cookie in x-analytics field of web-request logs - https://phabricator.wikimedia.org/T75835#853539 (kevinator) stalled>declined a:kevinator Closing task since it's not needed anymore. A workaround was implemented. [17:02:21] Analytics-Engineering, Analytics-Wikimetrics: Re-run Wikimetrics data once Labs issues are fixed [8 pts] - https://phabricator.wikimedia.org/T78305#853542 (Milimetric) [17:02:49] ottomata: no, there are not , oliver used to run it for a 'month' [17:04:33] ottomata: thus my conversation with leila about random sampling and reducing dataset a bit [17:04:49] aye [17:05:30] I mean, partitions are specified for webrequest_source [17:05:38] and no, I ran it for 31 days. Months are meaningless :) [17:06:16] sorry, 31 days [17:09:13] Analytics, Analytics-Engineering: Engineer adds data to X-Analytics header using mediawiki extension - https://phabricator.wikimedia.org/T78801#853554 (kevinator) NEW [17:09:31] Ironholds: at the time of running it it will swept all partitions on cluster, which might amount to more than 31 days of data, that is why i was saying "month", but understood [17:09:45] yep [17:09:47] which is a problem :/ [17:15:50] Analytics-Refinery: Hive User can specify webrequest date range in query more easily - https://phabricator.wikimedia.org/T76531#853589 (kevinator) A little background: * a UDF was deemed the wrong solution to the problem because Hive would have had to send all the partitions to the UDF to get determine if th... [17:19:59] Analytics-Refinery: Hive User calls UDF to extract fields out of X-Analytics header - https://phabricator.wikimedia.org/T78805#853590 (kevinator) NEW [17:26:22] Analytics: Move stat1001, stat1002 and stat1003 into Analytics VLAN - https://phabricator.wikimedia.org/T76346#853659 (Ottomata) @bblack I've started an etherpad to guide us for tomorrow's move: http://etherpad.wikimedia.org/p/stat-analytics-vlan. [17:32:10] (CR) QChris: [WIP] UDF for classifying pageviews according to https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters (2 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [17:39:52] Analytics-Refinery: Hive User calls UDF to pull real requests or IP out of X-Forwarded-For header - https://phabricator.wikimedia.org/T78812#853687 (kevinator) NEW [17:43:55] Analytics-Refinery: Hive User calls UDF to pull real requestor IP out of X-Forwarded-For header - https://phabricator.wikimedia.org/T78812#853696 (Ottomata) [17:52:02] ottomata, question if you may: [17:52:15] yush? [17:55:59] ottomata: in the apps unique query [17:56:00] https://gist.github.com/Ironholds/428014d22edb7969ff5c [17:56:39] ottomata: could we do the distinct calculation in oozie using something that is not hive? [17:57:52] ottomata: or is hive the best place to do it (dataset by then will be about 10 millions) [17:58:49] hm, why count distinct? oh this is to know how many apps are installed out there? [18:03:38] ottomata: yes [18:06:15] not sure how oozie is relevant [18:06:24] the distinct in hive is probably fine [18:06:31] you could do filtering in a udf if you wanted to [18:07:37] qchris, how do I reply to comments on an old version of a commit? [18:07:45] (responding to more of your comments, have patched for some of them) [18:08:49] Ironholds: Go to [18:08:51] https://gerrit.wikimedia.org/r/#/c/180023/ [18:09:09] To the left of "Patch Set 5", there is a grey rectangle [18:09:14] oh, wait! [18:09:14] Click it. [18:09:16] yes, found it. doy. [18:09:19] Thankee :) [18:09:24] I don't know why I didn't see that before. [18:09:25] yw. [18:11:07] (CR) OliverKeyes: [WIP] UDF for classifying pageviews according to https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters (7 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [18:11:18] qchris_away, thx, forwarded [18:11:18] t [18:11:35] yurikR: You're goot at stealing keyboard focus :-) [18:11:40] s/goot/good/ [18:11:52] i'm good at many things, focus is not one of them [18:12:01] Hahahaha :-P [18:48:01] Multimedia, MediaWiki-extensions-MultimediaViewer, Analytics: Add Last-Modified to performance logging - https://phabricator.wikimedia.org/T78767#853808 (Tgr) >>! In T78767#852888, @Gilles wrote: > Ah, it turns out that the "timestamp" column IS the Date header. So we only need Last-Modified. I think timesta... [18:49:21] Multimedia, MediaWiki-extensions-MultimediaViewer, Analytics: Add Last-Modified to performance logging - https://phabricator.wikimedia.org/T78767#853811 (Tgr) Oh yeah we have a manual timestamp field, I remember now. [18:53:26] Multimedia, MediaWiki-extensions-MultimediaViewer, Analytics: Add Last-Modified to performance logging - https://phabricator.wikimedia.org/T78767#853815 (Tgr) `Last-Modified` is a [[ http://www.w3.org/TR/cors/#simple-response-header | simple response header ]] so there should not be any CORS issues with this. [18:58:54] Multimedia, MediaWiki-extensions-MultimediaViewer, Analytics: Make upload.wikimedia.org serve images with Timing-Allow-Origin header - https://phabricator.wikimedia.org/T76020#853836 (Aklapper) If contributors wanted to work on this (as this task is marked as "easy"), would they find their way with the inform... [19:02:27] so um, Ironholds, qchris_away. i called that thing Webrequest on purpose [19:02:37] i could be convinced otherwise [19:03:15] as it can be more than just a class that works with pageview def. [19:04:06] originally, that change also had methods (and corresponding udfs) to get the site qualifier, or the project language [19:04:07] etc. [19:04:28] also, if we do ETL the webrequest logs to a different format, it might be a good place to build a webrequest object model [19:05:02] similar to the way I am doing for revisions here: https://gerrit.wikimedia.org/r/#/c/171056/5/refinery-core/src/main/java/org/wikimedia/mediawiki/Revision.java [19:07:33] ottomata, so, huh. [19:07:40] I think, two things [19:07:51] first, it's going to end up a REALLY big class if it's for all things webrequest [19:08:25] remember that just for the new def it will also have to incorporate access method tagging, and a wrapper around the UA parser that does device identification, and XFF param extraction, and a ton of other things [19:08:39] X_Analytics rather [19:08:50] and some of those are generaliseable methods to X_Analytics and some of them are very specific to pageviews. [19:09:24] e.g., some of those regex objects are only things we care about for pageviews. In fact, most of them. [19:09:31] it also creates an inconsistency between the WebStats UDF and the new def. [19:09:42] so basically I can see one of two things happening, here. [19:09:58] the first is we just suck it up and have a single Webstats class and pack everything into that [19:10:29] or we can have a Webstats class that contains generalised methods, such as X_Analytics parsing or site IDing, and then have a new/old def class that inherits from it. [19:11:03] I'd prefer the second, personally; abstracting things away is the big advantage of OOP! But I'm in the class of "not confident enough that they know what they're doing to do more than -1 something, Apache style". [19:11:11] So whatever the outcome, I'll be fine with it, but here is my case. [19:11:15] ...that was long. [19:14:37] all for inheritance, and hm, maybe what you say is ok [19:14:56] if doing OO things, i like object models, and webrequest could be one. that doesn't mean it couldn't encapsulate other more generic classes [19:15:06] so maybe the x-analytics and even pageview stuff could live elsewhere, dunno [19:15:15] but, it would be nice to have an instantiated webrequset object [19:15:17] from which you could do [19:15:24] if webrequest.isPageview() ... [19:15:28] rather than have to do [19:15:36] Webrequest.isPageview(fieldA, fieldB, fieldC) [19:15:37] etc. [19:15:56] yup [19:15:58] Ironholds: ^ :) [19:16:03] agreed! [19:16:05] moar elegant hive! [19:16:12] well, that is not hive [19:16:24] with hive, I think it is unlikely that one would use an instantiated webrequest class [19:16:36] since I think the UDFs always work with specific fields [19:16:36] fair! [19:16:41] and it would just be overhead to do [19:16:46] new Webrequets(fieldA, fieldB...) [19:16:50] yup [19:16:59] hmn. [19:17:18] but certainly other languages and techs that want to work with webrequests (or other data sources) [19:17:21] what would happen if we had the UDF just accept (*)? I imagine variadic functions are a tremendous pain in Java (they are in C++) [19:17:40] not having to type everything out would be, you know: nice, though. [19:18:26] Ironholds: e.g. [19:18:26] https://github.com/declerambaul/WikiScalding/tree/master/src/main/scala/org/wikimedia/scalding [19:19:03] I am getting lost in what get's discussed here. Is it still about the name of that class? [19:19:30] Ironholds: I guess variable parameters is not hard? [19:19:30] http://docs.oracle.com/javase/1.5.0/docs/guide/language/varargs.html [19:19:34] Multimedia, MediaWiki-extensions-MultimediaViewer, Analytics: Make upload.wikimedia.org serve images with Timing-Allow-Origin header - https://phabricator.wikimedia.org/T76020#853850 (Tgr) I'm still hoping to find an empty weekend to turn all easy MM bugs into GCI tasks. That would require a pointer to the ri... [19:19:35] qchris: yes, and the motivations for it :) [19:19:46] Java continues to suck less than C++! [19:19:48] well done Java. [19:20:41] In it's currenty form, the class only has one thing (uriHostPattern) that is not strictly �Pageviews only". [19:20:51] But uriHostPattern is pageview relevant. [19:20:58] So everything in there is pageviews. [19:21:07] Only one things might be reusable. [19:21:20] qchris, I am planning for the future! [19:21:35] overengineering from the start!, woohoo! [19:21:43] Yes, for the future, a separate Webrequest class that uses the Pageview class is a nice thing. [19:22:01] :) that would be fine with me :) [19:22:12] haha, a Pageview could extend Webrequest :p [19:22:36] NE way, ok ok [19:23:47] Meh. Not ok. You gotta argue. You cannot give in after me writing some 5 lines. [19:24:06] :-P [19:25:49] qchris, that is kinda what Ironholds said above, and I think it makes sense too. I'm just arguing that an modeling a webrequest might be a good idea [19:26:11] i'm fine if the webrequest model uses encapsulation to do its thang [19:26:17] qchris just approved an idea of mine [19:26:20] brb throwing a party [19:26:24] hahah [19:26:26] a CODE idea. Large party :D [19:26:30] although, i guess I'd start from my way, rather than your way [19:26:48] start with webrequest, and if it needs to be made decomposed into smaller pieces later, then cool [19:27:09] If you prefer to carve it out later ... that's fine by me too. [19:27:24] we are so good at letting the other have his way. [19:27:28] (sometimes) [19:27:28] :) [19:27:41] :-P [19:34:17] mmmn. [19:34:23] I'd rather start off how we mean to continue. [19:34:40] So, for the time being, let's focus on reviewing the pageviews UDF (which now has a specific class) and getting it in [19:34:53] and then I'll throw in a commit that just splits the reusable elements out into a static class [19:34:55] and we can go from there. [19:35:20] oh awesome, i get it qchris, re @RunWith(Parameterized.class) [19:35:21] awesome! [19:35:22] but I'd like to avoid actively building on the current structure, if we know we'll want to change it. Changing it is trivial; changing it after N months of building gets progressively more finnicky. [19:35:23] (CR) QChris: [WIP] UDF for classifying pageviews according to https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [19:35:36] ottomata: \o/ [19:35:53] qchris, aha! I get what you mean (re above patch) [19:35:55] Ironholds: changing the underlying structure is trivial [19:35:57] that's...weirdly beautiful [19:35:58] the UDF can remain the same [19:39:40] ok, qchris, re testing [19:39:53] i like the parametrized input, expected format [19:40:12] That's great! [19:40:29] can we make a little framework for reading the test parameters from a file? [19:40:46] oh wait... [19:40:48] maybe this exists already [19:40:56] hold on, googling... [19:41:34] Hahaha. You really want that external file. I see. [19:41:45] Well, then let's do that external file. [19:42:00] Halfak is also a big fan, for when we want non-Java implementations [19:42:11] the argument makes sense to me. Let's go for it. [19:42:16] qchris: https://github.com/Pragmatists/JUnitParams/blob/master/src/test/java/junitparams/FileParamsTest.java [19:42:16] ? [19:42:56] * qchris looks [19:43:06] also [19:43:06] http://stackoverflow.com/questions/21401504/junit-parameterized-tests-using-file-input [19:45:17] ah this is better link I thikn: http://www.codeyouneed.com/parameterized-junit-test/ [19:46:09] hm, as long as we can stick a json webrequest object into a csv file, then we could do: [19:46:21] Both are the same, aren't they? [19:46:26] {json webrequest object that is a pageview}, true [19:46:31] {json webrequest object that is not a pageview}, false [19:46:40] eh? [19:47:18] Both pages you linked use @RunWith(JUnitParamsRunner.class) [19:47:24] from pl.pragmatists [19:47:28] JUnitParams [19:47:33] yes, same, just a better doc [19:47:38] Ah. Ok. [19:49:21] ottomata: I think the library wouldbe worthwhile to try. [19:50:18] About the "json in csv" ... why put in json and not the directly the needed parameters. [19:50:20] ? [19:50:44] (CR) OliverKeyes: [WIP] UDF for classifying pageviews according to https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters (2 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [19:52:00] qchris, i think it would be better to start with the data we will actually be working with, especially if/when oliver adds new constraints that requires new fields [19:52:19] wikimedia/mediawiki-extensions-EventLogging#296 (wmf/1.25wmf13 - 3d94580 : Reedy): The build passed. [19:52:19] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/3d9458092336 [19:52:19] Build details : http://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/44370217 [19:52:27] actually, i would almost rather encode more properties about each test object than just true (meaning isPageview) [19:52:28] we could do [19:52:44] {webrequest object}, {is_pageview: true, is_app_request: false} [19:52:44] etc. [19:52:52] maybe? hmm, i think you will not like that [19:52:57] tell me it is not good... [19:53:09] Analytics-Wikimetrics, Analytics-Engineering: Re-run Wikimetrics data once Labs issues are fixed [8 pts] - https://phabricator.wikimedia.org/T78305#853887 (kevinator) [19:53:22] It is great! [19:53:32] you do like it!? :) [19:53:37] But I only do not like JSON in the mix there [19:53:40] ah [19:53:54] it looks like this lib only works with a 2 column csv [19:54:14] Really? [19:54:28] That would make the whole think pointless. [19:55:00] that's what all the examples do anyway, [19:55:03] column A is input [19:55:05] column B is expected [19:55:13] 1. @FileParameters("src/test/resources/NameUtils.test.csv") [19:55:13]