[14:20:33] (PS2) QChris: Drop Oozie bundle for Icinga monitoring of webrequest datasets [analytics/refinery] - https://gerrit.wikimedia.org/r/177217 [14:20:35] (PS1) QChris: Re-align HiveQL for creating the webstats table [analytics/refinery] - https://gerrit.wikimedia.org/r/182806 [14:20:37] (PS1) QChris: Rearrange directory layout underneath /oozie [analytics/refinery] - https://gerrit.wikimedia.org/r/182807 [14:20:39] (PS1) QChris: Archive pagecounts-all-sites to a directory having that name [analytics/refinery] - https://gerrit.wikimedia.org/r/182808 [14:20:41] (PS1) QChris: Rename webstats table to pagecounts_all_sites [analytics/refinery] - https://gerrit.wikimedia.org/r/182809 [14:54:20] (PS2) Ottomata: Abort deployment, if Oozie's Hive config seems to contain passwords [analytics/refinery] - https://gerrit.wikimedia.org/r/177714 (owner: QChris) [14:54:33] (CR) Ottomata: [C: 2 V: 2] Abort deployment, if Oozie's Hive config seems to contain passwords [analytics/refinery] - https://gerrit.wikimedia.org/r/177714 (owner: QChris) [14:54:50] (PS2) Ottomata: Improve error message in deploy script, if Oozie's Hive config is not readable [analytics/refinery] - https://gerrit.wikimedia.org/r/177715 (owner: QChris) [14:55:09] (CR) Ottomata: [V: 2] Improve error message in deploy script, if Oozie's Hive config is not readable [analytics/refinery] - https://gerrit.wikimedia.org/r/177715 (owner: QChris) [14:55:51] (PS2) Ottomata: Drop jars that are not on all worker nodes from Oozie's Hive config [analytics/refinery] - https://gerrit.wikimedia.org/r/177716 (owner: QChris) [14:57:30] ottomata: can you join the hangout? [14:57:41] yus [14:57:45] I cannot connect to Google Hangout :-( [14:57:50] Redirection loop :-(( [14:58:24] oh, i just joined [14:58:25] no one there [14:58:28] https://plus.google.com/hangouts/_/wikimedia.org/a-batcave [14:58:40] Yes. That one. [14:58:45] * qchris shakes fist at Google [14:58:55] hm [14:59:13] (CR) Ottomata: [C: 2 V: 2] Drop jars that are not on all worker nodes from Oozie's Hive config [analytics/refinery] - https://gerrit.wikimedia.org/r/177716 (owner: QChris) [14:59:29] (PS2) Ottomata: Allow to override deployment checks for Oozie's hive config [analytics/refinery] - https://gerrit.wikimedia.org/r/182564 (owner: QChris) [14:59:35] (CR) Ottomata: [C: 2 V: 2] Allow to override deployment checks for Oozie's hive config [analytics/refinery] - https://gerrit.wikimedia.org/r/182564 (owner: QChris) [15:00:13] Mhmm. The old batcave is working for me. The new one isn't :-(( [15:01:45] The new batcave is redirecting me to [15:01:45] https://accounts.google.com/ServiceLogin?service=buzz&passive=1209600&continue=https%3A%2F%2Fplus.google.com%3A443%2Fhangouts%2F_%2Fwikimedia.org%2Fa-batcave%3Fauthuser%3D1%26pli%3D1&authuser=1 [15:01:45] And there the browser complains about redirection errors. [15:01:54] huh... [15:02:15] and you're logged in separately from all that? [15:02:22] Like, you're authenticated against google [15:02:36] Yup. [15:02:55] Trying third browser :-) [15:03:40] third browser's a charm! [15:03:42] :) [15:11:56] Ops-Access-Requests, Analytics-Cluster: Access to Hadoop Cluster for Ananth Ramakrishnan (new contractor) - https://phabricator.wikimedia.org/T85229#955098 (Ottomata) Open>Resolved [15:19:30] qchris: I will try to CR your pagecounts code today, forgot to say. [15:19:55] nuria: k. thanks. [15:53:13] ok, qchris_away [15:53:17] oh, lemme know when you are back [15:54:00] (CR) Gilles: [C: 1] "The table doesn't exist until the tracking lands on mediawiki.org, can't +2 this just yet" [analytics/multimedia] - https://gerrit.wikimedia.org/r/182404 (owner: Gergő Tisza) [15:56:00] (PS3) Ottomata: Drop Oozie bundle for Icinga monitoring of webrequest datasets [analytics/refinery] - https://gerrit.wikimedia.org/r/177217 (owner: QChris) [16:11:44] ottomata: back. [16:12:17] Was it about the deployment, or things like "pagecounts-all-sites" vs. "pagecounts_all_sites"? [16:13:06] things in general, i think they're all cool. just wanted to ask a stupid question that maybe i've asked before [16:13:17] i'm all for removing unused code, re. the icinga check thing [16:13:20] send_nsca [16:13:28] but, what I don't like, is that no one will ever find it! [16:13:35] unless you or I remember that it existed once [16:13:37] Last time we made a branch. [16:13:44] With a descriptive name. [16:14:07] Should I create a branch "removing_webrequest_monitoring"? [16:14:14] on refinery? [16:14:15] we did? [16:14:18] or something else? [16:14:27] In kraken. [16:14:48] https://gerrit.wikimedia.org/r/#/admin/projects/analytics/kraken,branches [16:14:58] Like the branch etl-storm [16:15:09] Like the branch funnel [16:16:41] ah [16:16:51] We could do the same thing here. [16:16:56] i dnno, i just wanted to say that it make sme sad to just delete it. i'm ok with just removing it [16:17:08] if you don't mind making a branch, that's cool, but i'm not TOO worried about it :) [16:17:27] But with the branch, the commit that removes it, sticks out a bit, [16:17:30] and is better noticed. [16:17:34] ok cool [16:17:35] then let's do it [16:17:40] I am also sad to remove it. [16:17:46] But meh. Icinga version issues :-( [16:17:51] aye [16:20:44] Analytics-Cluster: Add done-flag for webrequest partitions indicating that the partition is available (but not yet checked) - https://phabricator.wikimedia.org/T85811#955302 (Ottomata) NEW a:Ottomata [16:25:55] (CR) Nuria: "Code looks good, I think only a unit test is missing, we should test the "user can tag its own cohort" and "user cannot tag somebody else'" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/182391 (owner: Bmansurov) [16:26:29] qchris: if you can, feel free to force push that code to a branch, no need for review of branch creation [16:26:32] of code that already exists [16:26:38] if you do that, i'll just merge this one change [16:28:38] ottomata: Done [16:28:38] https://gerrit.wikimedia.org/r/#/admin/projects/analytics/refinery,branches [16:28:41] Branch archive/webrequest_icinga_monitoring [16:29:45] cool, danke [16:30:12] (CR) Ottomata: [C: 2 V: 2] "A pointer to this code was added in the branch archive/webrequest_icinga_monitoring" [analytics/refinery] - https://gerrit.wikimedia.org/r/177217 (owner: QChris) [16:30:26] (PS2) Ottomata: Re-align HiveQL for creating the webstats table [analytics/refinery] - https://gerrit.wikimedia.org/r/182806 (owner: QChris) [16:31:12] (CR) Ottomata: [C: 2 V: 2] Re-align HiveQL for creating the webstats table [analytics/refinery] - https://gerrit.wikimedia.org/r/182806 (owner: QChris) [16:31:26] haha, qchris, man i have a bad memory [16:31:33] i do not really remember discussing this change, but I like it :) [16:31:40] the webrequset/load, etc. [16:31:40] move [16:31:43] (PS2) Ottomata: Rearrange directory layout underneath /oozie [analytics/refinery] - https://gerrit.wikimedia.org/r/182807 (owner: QChris) [16:31:45] :-D [16:31:57] (CR) Ottomata: [C: 2 V: 2] Rearrange directory layout underneath /oozie [analytics/refinery] - https://gerrit.wikimedia.org/r/182807 (owner: QChris) [16:32:05] The IRC logs have the discussion. Let me find it again. [16:32:13] (PS2) Ottomata: Archive pagecounts-all-sites to a directory having that name [analytics/refinery] - https://gerrit.wikimedia.org/r/182808 (owner: QChris) [16:32:19] (CR) Ottomata: [C: 2 V: 2] Archive pagecounts-all-sites to a directory having that name [analytics/refinery] - https://gerrit.wikimedia.org/r/182808 (owner: QChris) [16:32:24] qchris: are you planning on resubmitting these jobs? [16:32:25] IIRC there was some discussion on an etherpad. Even has-har chimed in. [16:32:31] with the name changes, e.g. pagecounts-all-sites? [16:32:38] there is no logical change, so I guess it odesn't really matter [16:32:46] Yes. That was my plan for today. [16:32:55] oh, maybe table name change [16:32:56] aye cool [16:33:17] (PS2) Ottomata: Rename webstats table to pagecounts_all_sites [analytics/refinery] - https://gerrit.wikimedia.org/r/182809 (owner: QChris) [16:33:24] Also ... some jobs are currently deployed from "/wmf/current". So they'd break upon the next deploy. [16:33:25] (CR) Ottomata: [C: 2 V: 2] Rename webstats table to pagecounts_all_sites [analytics/refinery] - https://gerrit.wikimedia.org/r/182809 (owner: QChris) [16:33:45] So I'd rather do it now, while we know what to do, and how to do it. [16:36:16] aye [16:36:18] cool [16:36:20] go for it [16:36:40] k. [16:36:49] * qchris goes to break oozie :-) [17:00:52] qchris: fyi, i'm deploying that varnishkafka change, expect seq reset [17:01:03] Nice! [17:03:10] (CR) Nuria: [C: 1] [WIP] First draft of refinement phase for webrequest (4 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/182478 (owner: Ottomata) [17:06:45] (CR) Ottomata: [WIP] First draft of refinement phase for webrequest (3 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/182478 (owner: Ottomata) [17:10:59] qchris: re anchored regexes in pageview change [17:11:06] you suggested that I make them unanchored! [17:11:14] they previously did ^...$ with the regex [17:11:15] yes. [17:11:18] ^.*....*$ [17:11:33] Some should be unanchored, some should be anchored. [17:11:41] ah, i guess Ironholds will ahve to address that then [17:11:51] I think so too. [17:12:02] He should know where ho would want anchoring. [17:12:02] ja, ok, Ironholds, if you are there, I think qchris' recent comments are all for ouy [17:12:07] let's get this thing merged! [17:32:05] qchris: are you able to join the debrief hangout? [17:32:16] Argh. [17:32:17] 1 sec. [17:55:13] (PS1) QChris: Fix place for workflow to add partitions [analytics/refinery] - https://gerrit.wikimedia.org/r/182845 [17:56:02] (CR) Ottomata: [C: 2 V: 2] Fix place for workflow to add partitions [analytics/refinery] - https://gerrit.wikimedia.org/r/182845 (owner: QChris) [17:56:22] Thanks! [18:00:05] (PS2) Ottomata: [WIP] First draft of refinement phase for webrequest [analytics/refinery] - https://gerrit.wikimedia.org/r/182478 [18:19:37] Analytics-Wikimetrics, Analytics-Engineering: Epic: Grantmaking User gets reports on Wikimetrics usage - https://phabricator.wikimedia.org/T76106#955630 (Capt_Swing) Looks like the only remaining blocker on this task is mailing list notification. When will it be complete? [18:27:40] ottomata, jfyi, plan to handle qchris's latest comments on the pageview UDF this morning [18:27:44] (to avoid us both working on it :D) [18:32:06] cool, danke [18:32:10] (CR) Bmansurov: "Thanks for the review. I'll add tests." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/182391 (owner: Bmansurov) [18:39:05] Analytics-Cluster: Force Hue https redirects. - https://phabricator.wikimedia.org/T85834#955712 (Ottomata) NEW a:Ottomata [18:39:46] chasemp: btw, do you happen to know if that is set for all wmf https requests? [18:39:48] or jsut misc? [18:40:35] dunno only tested with misc [18:41:02] k [18:41:03] danke [18:50:13] Ironholds: yt? [18:51:01] ottomata, yup [18:51:50] I just learned that we set X-Forwarded-Proto: https for all of our https requests [18:51:54] we could log this header [18:51:57] would that be useful? [19:01:10] ottomata, yes, but there should already be some patches in for varnish that accidentally set that logging everywhere [19:01:16] as a consequence of Apps' changes [19:02:30] ? [19:02:41] so, milimetric, yt? [19:04:02] yea [19:04:07] one sec [19:04:23] oop, just realized i'm late for ops meeting [19:04:25] one hour for me... [19:05:19] qchris: how do you run tests on agreggator depot via nosetests directly? or tox ? [19:05:33] nuria: I run them via tox [19:08:41] milimetric: i can kinda chat durin gmeeting [19:08:51] wanted to say something about the xmldump format experiment results [19:10:42] qchris: does plain "tox" works for you? [19:11:03] nuria: Yes. [19:11:08] Which test is failing for you? [19:11:38] qchris: nah, no tests, i normally run tox inside vagrant but not on my mac let me set it up [19:12:33] nuria: You can also run the tests in vagrant, if you have tox there. [19:13:27] qchris: right, but this should work easily on mac too w/o issues as tox is supposed to work well, let me see one sec, [19:13:31] ah.. i know [19:16:09] ottomata: ok, listening [19:16:19] so, ja, re the xmldump results [19:16:42] i think the reason straight json performed better than others (avro, etc.), is because of hadoop streaming, and the avro input format [19:17:01] the avro input format that hadoop streaming uses [19:17:04] converts avro to json [19:17:17] so, aaron's code expects json no matter what [19:17:22] and parses it either way [19:17:34] in the case of avro, there is an extra conversion step (avro -> json) [19:17:37] right [19:17:53] as for the reducers never finishing...dunno much [19:17:57] ls [19:18:05] hi drdee! [19:18:06] :) [19:18:09] haha [19:19:08] right - that's the weird thing, the format differences and the extra conversion doesn't seem to me to account for the huge performance problem we seem to have [19:19:25] like - longer on 50 reducers than a single box? [19:19:26] weird... [19:19:32] yeah, that seems unrelated to me [19:20:07] i don't know much about how aaron's job works, but i also doubt it is a cluster problem. more likely it is a job optimization problem. [19:20:20] not sure. [19:20:22] could be wrong [19:23:12] (PS1) Yurik: Run HQL queries as part of the script [analytics/zero-sms] - https://gerrit.wikimedia.org/r/182863 [19:23:35] (CR) Yurik: [C: 2 V: 2] Run HQL queries as part of the script [analytics/zero-sms] - https://gerrit.wikimedia.org/r/182863 (owner: Yurik) [19:23:41] it seems like the same logic in both cases, on cluster and stat1003 [19:25:39] nuria: I see that 5 monitoring tests are failing. It seems the logic to create fixtures does not work on each weekday :-/ [19:25:51] qchris: just saw that, yes [19:25:59] ah ok. [19:26:32] right, milimetric, same logic, i'm sure the logic is fine. i'm more thinking about how it interacts with hadoop streaming, etc. e.g. why 50 reducers? [19:27:13] qchris: it happens from 1st changeset: https://gerrit.wikimedia.org/r/#/c/182680/1 but i take is not related to changeset itself right? [19:27:15] right, i should take a closer look [19:28:13] nuria: I assume it is related to the current date. [19:28:17] ottomata, sorry, was unclear earlier [19:28:32] there's a https=1 flag for the mobile varnishes; as a consequence of some changes Apps are making, that will be on the text varnishes too [19:28:45] so it's indirectly solved [19:30:13] nuria: But the monitoring is less of an issue for now. You can comment out the relevant test for your review, and I'll see to getting the monitoring fixed afterwards. [19:31:06] qchris: ok, let me see how tests are working as that would be good for me to know [19:31:07] (CR) Mforns: "Hi Rtnpro," (9 comments) [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/180828 (owner: Rtnpro) [19:39:36] Ironholds: https=1 is part of X-Analytics? [19:40:14] yep [19:43:13] (PS22) OliverKeyes: Add UDF for classifying pageviews [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [19:43:39] (CR) OliverKeyes: Add UDF for classifying pageviews (4 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [19:43:56] patch made [19:44:32] Cool, Ironholds, the only thing I see not responded to is the question about anchored regexes [19:45:08] oops [19:45:11] didn't see; will change [19:45:46] (PS23) OliverKeyes: Add UDF for classifying pageviews [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [19:45:50] nuria, one question about bug T85233 [19:46:00] mforns: aham [19:46:01] (CR) OliverKeyes: Add UDF for classifying pageviews (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [19:46:03] done [19:46:18] what are the json files that you refer? [19:46:22] Ironholds: I don't think you can just switch it [19:46:22] config files? [19:46:27] darnit [19:46:36] qchris: is asking if the regexes should or should not be anchored [19:46:46] find by default makes then unanchored [19:47:01] mforns: any, metric files on teh default view for example [19:47:09] matches will make them match against the entire string [19:47:09] ah, I misread, I thought that was a suggestion in the same way qchris's other suggestions are suggestions :D [19:47:25] then it should be find [19:47:28] we'd have to add ^ and $ at the end to do whatever [19:47:42] ...how does one revert to the previous state? [19:47:52] ha, easiest just to edit and recommit i think [19:47:55] mforns: for example if default view is newly registered, what happens is newly registered for enwiki is not available [19:47:58] yeah, thought so :/ [19:48:00] Ironholds: It's fine to just say "shut up" to my comments :-) [19:48:06] aha [19:48:09] qchris, but your comments are usually right! :D [19:48:18] but these aren't CSV files? [19:48:18] I read it in the same way I read "is the weirdness that Z is capitalised?" :P [19:48:22] nuria, ^ [19:48:43] Ironholds: :-P [19:48:47] mforns: the "metrics" could be cvs or json, wikimetrics metrics are json [19:48:58] oh I see, ok, now I understand [19:49:07] yay, test failure [19:49:10] mforns: ok [19:49:41] nuria, then you were not talking about defaultDashboard config or CategorizedMetrics config, right? [19:50:09] (PS24) OliverKeyes: Add UDF for classifying pageviews [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [19:50:14] done and done [19:50:38] mforns: that is different, as dashiki cannot work without those files, but it should work -just -fine- if it cannot retrieve a metric [19:50:41] file [19:51:13] nuria, ok, thanks! [19:51:39] so, Ironholds, I think qchris wants you to think about each regex, and decide if they all shoudl be unanchored [19:51:54] by using unanchored find(), the regex can match anywhere in the string [19:52:36] ahhh [19:52:41] * Ironholds re-reviews [19:53:17] if we really want to use a little helper method with either find() or matches() it'll need to be find [19:53:23] but we should totally add ^ where applicable [19:53:35] well, we could have 2 helper methods, one for each [19:53:44] if we wnat/need to [19:55:46] that seems pretty brittle, though [19:55:48] it's an extra thing to fark with if the regexes need updating [19:56:08] "update regex, make sure we're using the right wrapper method to /call/ that regex and check that" [19:56:29] I'd rather just have sensible regexes (i.e., those with ^ or $ where appropriate) [19:56:33] currently amending the Webrequests class but will tweak the regexes and test when done [20:01:36] indeed. Ironholds, if you anchor all the regexes, then let's just always use matches [20:01:43] ok... [20:02:45] ottomata, well, I'll make sure all of them /can/ be anchored ;p [20:03:02] I can see some problems with anchoring the special pages [20:03:05] what happens if we anchor and run through find()? Does it ignore the anchoring? [20:04:17] i dunno actually, i doubt it would ignore. [20:04:23] it would just make it part of the match. [20:04:42] i guess using find would allow you to anchor on either side of the regex, without having to do both [20:05:07] qchris: when Ironholds figures this out, will we be ok to merge? [20:05:12] anything else you want to comment on? [20:05:23] * qchris reads backscroll [20:05:36] ottomata, then we may have to do that. Some regexes do not lend themselves to anchoring [20:05:41] in particular the special pages one [20:06:44] I will experiment and see if the tests blow up, I guess [20:07:09] ottomata: merge at will :-D [20:08:55] okay, so I can totally add ^ to a couple of regexes, but there are some fundamentally un-anchorable ones [20:09:19] uriQueryPattern, for example [20:09:19] Ironholds: let's just keep using find() for all. I think it will work. Anchor the ones that make sense to. [20:09:34] have done, tests still pass, will merge and await experimentation to prove I suck at writing tests :D [20:09:51] (PS25) OliverKeyes: Add UDF for classifying pageviews [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [20:10:00] leila: did you get the doc about random sampling written somewhere? [20:13:13] await experimentation, Ironholds? [20:15:00] nuria, I'm documenting it here. Will finalize it by eod: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques [20:15:12] and will send an email to the list nuria [20:15:38] leila: ottomata has a change WIP that will allow us to use the random sampling with hive partitioning [20:17:05] ottomata, yeah, I need to do some of that today :/. So much to do. [20:17:28] aye, ok. [20:17:59] nuria: this change is kinda wip, i'm experimenting with a couple of settings on it to see what does better, but it will be mostly a guess on some of them for now. i might merge it soon and start converting it regularly, we will see. [20:18:08] in addition to clustering the table, it will be stored inparquet [20:18:14] which should make your query much faster too [20:18:21] since you only need like 2 columns [20:18:40] the amount of data it will have to read (and thus the amount of disk I/O needed) will be mcuh miuch less [20:19:15] ottomata: understood, it's just saw so leila knows that we will be able to do the sampling effectively eventually [20:19:23] nuria: I'll share the link and ottomata can add the new way, or I add it once I parse it. [20:39:16] (CR) QChris: [C: -1] "Computation of week number is faulty." (3 comments) [analytics/aggregator] - https://gerrit.wikimedia.org/r/182676 (owner: QChris) [20:40:25] (CR) QChris: [C: -1] "Week numbers are likely to be wrong. See comment on line 266 of" [analytics/aggregator/data] - https://gerrit.wikimedia.org/r/182684 (owner: QChris) [20:43:29] nuria: re: "we have been collecting data but due to caching is not until this week we have data for all users." -- what do you mean? which users do we not have data on? [20:44:42] ori: "the whole reader base at large", as with the 30 days cache for the new module we added we have collected data for pages expired /reders (makes sense?) [20:45:15] nuria: makes sense, but do you expect that to bias the data? [20:45:57] ori: for the yes, no, I would like to get more data as to the "no" [20:46:25] ori: but overall teh question was "is it safe for us to use it sendbeacon when available?" [20:46:25] ?? [20:46:31] ori: answer is yes [20:46:35] right [20:46:44] in fact, i think we should use it by default whenever it is available [20:46:47] rather than have a special api [20:47:02] i'm basing this on your analysis, which is convincing, i think [20:47:40] ori: right, we have little data for browsers that do not support sendbeacon well (as there is few (any?)) and that is where i was hopping overall data might provide some info (it might not) [20:48:16] the numbers are so small that i don't think it's interesting. it could be stuff like: proxies that don't properly support POST [20:49:29] ori: there will be some of that sure, but i bet we will find some other cases, but, you are right overall, they will be very non significant [20:49:45] ori: otherwise they would have shown up already [20:49:46] the reason i'm eager is because removing the experiment would allow me to go ahead with , which has a bunch of other improvements too, and which i really don't want to split up [20:49:50] yes, agreed [20:50:56] ori: that is fine, i think the only caveat would be adding an if clause for a browser [20:51:14] yeah, that's there [20:51:14] with sendbeacon available but not working if proceeeds [20:51:26] ori: but our main case is assuming it works well [20:51:40] see line 222 of https://gerrit.wikimedia.org/r/#/c/182557/7/modules/ext.eventLogging.core.js (right side) [20:52:29] oh, i see what you're saying. you think we should make an imagerequest if navigator.sendBeacon returns false? [20:54:08] ori: no it returns true even if you send your stuff to "garbage",that is probably not a good gage, i think it will be a plain browser check if any [20:54:34] as in if navigator.sendBeacon and !some-version-of-safari-that-we-know-it-no-works [20:54:49] is there really such a browser? [20:55:54] ori: that is what I will know once i have a little bit more data, but I "suspect" answer is going to be "no" [20:56:37] it wouldn't be worth adding an exception for such a rare browser anyhow, imo [20:59:03] ori: it's probably more important to document it yes, from the data we have have we know it's not a common occurrence. [20:59:56] so, can we disable the experiment? [21:00:53] ori: when will these changes get deployed? [21:04:40] nuria: today, ideally [21:05:04] ori:let me look at the tables for a sec [21:05:55] hey nuria, lemme know when you got a sec, just want your opinion on some clustered sampling stuff [21:06:06] ottomata: k, give me 5 mins [21:08:27] (PS3) Ottomata: [WIP] First draft of refinement phase for webrequest [analytics/refinery] - https://gerrit.wikimedia.org/r/182478 [21:11:32] (PS2) OliverKeyes: [WIP] start of a generalised class of UDFs for handling the webrequests table [analytics/refinery/source] - https://gerrit.wikimedia.org/r/181939 [21:14:07] Ironholds: I can review that for you, but first, fix tab/spaces issue :), and alpha sort any regex cases please :) [21:14:29] ottomata, review what? [21:14:33] oh, the generalised class? [21:14:38] sure. You want tabs or spaces? [21:14:40] Webrequest stuff [21:14:42] 4 spaces. [21:14:51] gotcha. Also, realise it actually doesn't work [21:14:53] https://gerrit.wikimedia.org/r/#/c/181939/2/refinery-core/src/main/java/org/wikimedia/analytics/refinery/Webrequests.java [21:14:55] ah ok [21:14:56] I can't get the pom to recognise JUnit [21:14:59] so not ready to review? [21:15:12] also, nobody on the internet knows how to have multiple UDFs for one Java class [21:15:19] did you know Hive's UDF examples are in *C++*? [21:15:20] stupid hive [21:15:23] ? [21:15:24] no cookie for you [21:15:30] their "how to write UDFs" [21:15:51] where? [21:16:55] oh wait, this is *cloudera* being stupid [21:16:58] http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_udf.html [21:17:28] aye ja, that looks impala related [21:17:50] i'm not sure what you mean by 'for one java class' [21:17:54] you mean in the hive udf class? [21:18:10] if so, you can't. each class is a single UDF [21:18:15] ori: we have about twice as many data as before, about 1.5 million records , do you know which is the % of pages that get evicted of the cache? [21:19:07] not off the top of my head [21:19:16] ottomata, huh. So, one Java class, called by multiple UDF classes? [21:19:27] Makes sense: I'll spend my afternoon tinkering there, then [21:19:30] yes [21:19:45] the one java class is not a requirement at all, we just like doing things that way so we aren't tied to hive at all [21:19:50] yeah [21:19:53] hive UDF classes should be very thin and call out to other code [21:20:04] they are just a hive udf interface wrapper [21:20:05] tis all [21:20:05] we really need to get this UDF - the pageviews one - in [21:20:12] too much infrastructure for other UDFs is tied into it [21:20:14] i'm here to merge! how's it! [21:20:15] ? [21:20:21] the pom calls for unit testing frameworks, the csvs.. [21:20:22] oh! [21:20:27] * Ironholds double-triple-checks [21:20:47] agree, there are a lot of WIP changes that have their own poms in refinery-core [21:20:52] gonna be conflics if/when we try to merge them [21:20:57] would be good to get this in first so we have base pom [21:21:14] exactly [21:21:22] it LGTM; I've done as much anchoring as I can [21:21:28] but obviously it's partly my code so I would recommend caution [21:21:33] given that we're up to patch 22 ;p [21:21:44] ori: i think that , to be precise, it will be best to merge changes when we have data for more readers if possible, that will be 1 more week [21:21:44] aye, welp, your code is your code, i can only really review the structure and organization [21:21:48] the logic is mostly up to you :) [21:22:07] Ironholds: patch #22 is not so bad [21:22:07] nuria: okay, that's fine. i can wait. thanks for looking into it. [21:22:09] really [21:22:21] ori: Thank youuu [21:22:29] qchris_away: will you give your +1 on this if it is ok with you? [21:22:30] https://gerrit.wikimedia.org/r/#/c/180023/ [21:22:44] ottomata: talking about random sampling? [21:23:22] ja so, to do the clustering, i pick some field, or fields, and then pick a number of buckets [21:23:32] hive will then hash the field values to a number, and mod by the buckets and store those in files [21:24:07] for just sampling, the choice of field doesn't matter, as long as there's wde range of possible values [21:24:21] and the values aren't particularly skewed to a single value [21:24:34] so, picking uri_host is bad, for example, because of the skew towards en.wikimedia [21:24:37] wikipedia* [21:24:42] but. [21:24:57] there are advantages to picking a field on which you might do joining [21:25:28] ottomata: I do not see us doing much joining though (but you might know best) [21:25:45] is more like heavy text processing per record [21:26:01] aye, i mean, i guess if we got more mediawiki data in, mabye joining would be useful? [21:26:05] e.g. category counts? [21:26:07] view counts* [21:26:09] dunno. [21:26:25] so, i'm thinking of just picking ip for now, as it it shoudl be a wide range of values [21:26:40] i guess if we decide we need something cooler later for joins, we can change? [21:27:04] ottomata: i think for wiki data might be too early to tell, for page data I do not think joining is as important [21:28:00] ottomata: right, whatever we pick we should be able to change, as teh best choice will probably become obvious with usage [21:28:03] *the [21:28:31] nuria: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Data_Format_Experiments#Results [21:29:03] ottomata: oooohhhhh [21:29:26] that is without clustering, i'm not sure how you want to test that [21:29:36] sorry [21:29:39] without sampling* [21:29:54] ottomata: nice! [21:30:20] ottomata: in this case the apps data is so tiny that we do not benefit from sampling so much (although, by all means we can test it) [21:30:34] ottomata: sampling will be supper handy for user agent stats for example [21:33:08] aye cool [21:33:20] nuria: q, did you have to manually set -Duser=$USER when you were testing your oozie jobs? [21:33:33] hive.exec.scratchdir [21:33:34] /tmp/hive-${user} [21:33:39] ottomata: i set it on the properties file yes [21:33:42] didn't have the ${user} var set for me unless I did that [21:33:42] ah ok [21:33:43] which is teh same [21:33:48] ok [21:34:31] on my coordinator properties: user = nuria [21:34:39] ^ ottomata [21:34:42] aye [21:34:46] otherwise it no work [21:34:59] ok, when you submit the production ready patch, you should jsut set that to hdfs [21:35:07] and then just use -Duser=$USER to override if you are testing [21:35:56] ottomata: ok, will do, i have not gotten to that today I was trying to get familiar with testing and code for qchris pageview agreggation but I should be able to get ahead this afternoon [21:36:33] (PS4) Ottomata: First draft of refinement phase for webrequest [analytics/refinery] - https://gerrit.wikimedia.org/r/182478 [21:36:54] k, np [21:43:03] (CR) Nuria: [C: 2] "Test test_monitoring fails due to the switch from 2014 to 2015 I believe, seems completely unrelated to this change so voting +2" [analytics/aggregator] - https://gerrit.wikimedia.org/r/182680 (owner: QChris) [21:49:59] hm, nuria. question. [21:50:11] oh, nm nm nm [21:50:15] neverind :p [21:51:52] nuria: any chance you could +2 the EL commit before 4pm today? i'd like to get it out if possible [21:54:09] (CR) QChris: [C: -1] "Sorry for not spotting earlier that the files are not in the right place." (3 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [21:54:43] ottomata, I am making moving those files your problem :D [21:55:08] ah good catch qchris, thanks [21:55:09] haha, ok [21:57:11] (PS26) Ottomata: Add UDF for classifying pageviews [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 [21:57:33] ori: i really prefer someone that crs mw frequently to do it, i only look at EL code once in a blue moon so i would not be able to catch any major mistakes.. let's see if matt is arround [22:02:30] qchris: kind of late but since you seem to be arround ... [22:02:48] qchris: do tests fail for you on this changeset : https://gerrit.wikimedia.org/r/#/c/181396? [22:03:16] let me check ... [22:03:46] nuria: No. All 68 tests pass. [22:04:14] Does some test fail for you with that change? [22:04:57] qchris: no, mi mistake, sorry [22:13:06] (PS5) Ottomata: First draft of refinement phase for webrequest [analytics/refinery] - https://gerrit.wikimedia.org/r/182478 [22:17:05] (CR) QChris: [C: 1] "I did not check if the implemented logic itself is sane, but" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [22:17:30] qchris, so, if you're only +1ing...can I +2? :D [22:17:52] Ironholds: Do whatever seems to fit :) [22:18:17] the problem with that and impostor syndrome is I want to -2 it and run screaming out of the house [22:18:24] hey, halfak. Wanna see something cool? [22:18:40] (CR) OliverKeyes: [C: 2 V: 2] Add UDF for classifying pageviews [analytics/refinery/source] - https://gerrit.wikimedia.org/r/180023 (owner: Ottomata) [22:18:43] Ironholds: Then CR-2 :-P [22:18:46] there's something cool [22:18:47] If it's more snow, I've already seen it. ;P [22:18:55] nope. Look at the last gerrit-wm line [22:19:42] Cool. For those who do not speak gerrit, is this merged? [22:19:56] +2d, not yet merged [22:20:03] i just merged. i will now build an updated refinery-hive.jar and put it in archiva and deploy to stat1002 [22:20:07] yay! [22:20:12] oh, gotta bump refinery versino :) [22:20:22] ottomata, I'll kill the existing legacy UDF and rebase on top of the merged version [22:20:28] then I can take advantage of the same tests [22:20:34] ok, you don't have to abandon it, rebasing should work [22:20:37] Ironholds, don't forget to go check a checkbox on Trello :) [22:20:44] ottomata, yeah, but there are gonna be hella-conflicts [22:20:46] halfak, yessir! [22:22:15] (PS3) OliverKeyes: [WIP] start of a generalised class of UDFs for handling the webrequests table [analytics/refinery/source] - https://gerrit.wikimedia.org/r/181939 [22:22:16] qchris: do nokw of a better way to 'release' other than: edit poms, commit, tag, build, upload to archiva, edit poms back to -SNAPSHOT and commit again? [22:22:32] oh, rebasing actually worked [22:22:35] ottomata: "mvn release" [22:22:46] wait, no it didn't [22:22:48] But that might need setup of a mvn plugin. [22:23:06] it's the pom presence; I'll kill that, amend and rebase [22:23:53] (PS4) OliverKeyes: [WIP] start of a generalised class of UDFs for handling the webrequests table [analytics/refinery/source] - https://gerrit.wikimedia.org/r/181939 [22:24:08] bwer, don't feel like setting up a plugin righ tnow [22:25:21] (Abandoned) OliverKeyes: [WIP] UDF for identifying if a request meets the legacy pageview definition. [analytics/refinery/source] - https://gerrit.wikimedia.org/r/181049 (owner: OliverKeyes) [22:25:56] (PS1) Ottomata: Bump version to 0.0.3 to include Pageview classification logic [analytics/refinery/source] - https://gerrit.wikimedia.org/r/182937 [22:25:58] qchris: ^ [22:26:12] * qchris looks [22:28:31] (CR) QChris: [C: 2 V: 2] Bump version to 0.0.3 to include Pageview classification logic [analytics/refinery/source] - https://gerrit.wikimedia.org/r/182937 (owner: Ottomata) [22:38:26] ottomata, I don't suppose you're familiar with a way to make maven actually goddamn tell you how the tests failed? [22:38:49] "please refer to the ObviouslyJavaRelatedBecauseIt'sSoStupidlyVerbVerbNamedVerbed file!" no. tell me. FRICKING TELL ME. IF I BROKE SOMETHING TELL ME. [22:40:01] oh, -e. Helpful [22:45:09] ottomata, so I'm starting off by trying to use the same tests for legacy and non-legacy pageviews defs [22:45:20] unfortunately it seems to object to two columns of booleans. Got a minute to help debug? [22:45:44] sure [22:46:16] yay! so, https://gist.github.com/Ironholds/80245a95b6c9f690c6bd [22:46:28] the file currently looks just the same as it used to, but with a is_legacy_pageview boolean column added [22:46:42] if I do not include is_legacy_pageview, it shouts. If I include it but do not use it, it shouts. [22:46:48] if I include it and use it, it shouts [22:47:04] are the number of columns in your csv file the same as the number of args to your test method? [22:47:17] the assertEquals call? [22:47:28] no, the arguments to the testIsPageview mthod [22:47:35] it will map csv columns to arguments [22:47:42] in order [22:47:49] yes! [22:48:43] they are the same? [22:48:46] and the column types make sense [22:48:54] yup [22:48:58] e.g. only boolean values in boolean column psoitions? [22:49:10] yup! [22:49:34] so in your description [22:49:42] 'the file' refers to a new test file? [22:50:06] nope, same one, with an extra field added-hangon [22:50:16] "Number of parameters inside @Parameters annotation doesn't match the number of test method parameters." [22:50:36] that...what. [22:50:43] There are 9 parameters in annotation, while there's 8 parameters in the testIsPageview method [22:50:57] I mean, there are [22:51:13] because if I include is_legacy_pageview in the test method, it tells me it doesn't know how to assertEquals(string,boolean,boolean,boolean) [22:51:25] which is fair enough because I don't want it dealing with the second boolean set :/ [22:54:18] (PS1) Mforns: Prevent crashing ui when metric data is missing [analytics/dashiki] - https://gerrit.wikimedia.org/r/182946 [22:54:21] i am not following [22:54:37] so, what we want is the test csv to have two boolean columns, is_pageview and is_legacy pageview [22:54:50] the pageview UDF tests read them in and check the outcome of isPageview is the same as is_pageview [22:55:05] the legacy pageview UDF tests read them in and check the outcome of isLegacyPageview is the same as is_legacy_pageview [22:55:18] that way we only need one file of test cases to maintain [22:55:22] (PS1) Ottomata: Update refinery-hive to 0.0.3 [analytics/refinery] - https://gerrit.wikimedia.org/r/182947 [22:55:35] but JUnit doesn't seem to like reading in columns it doesn't use. [22:55:59] as in, it is complaining because assertEquals doesn't use the column? [22:56:01] that doesn't seem right [22:56:26] "There are 9 parameters in annotation, while there's 8 parameters in the testIsPageview method" seems to be [22:56:35] (CR) Ottomata: [C: 2 V: 2] Update refinery-hive to 0.0.3 [analytics/refinery] - https://gerrit.wikimedia.org/r/182947 (owner: Ottomata) [22:57:08] what if you just use is_pageview in a useless way [22:57:18] boolean tmpvar = is_pageview; [22:57:27] seems weird to me though [22:59:06] (PS1) Ottomata: Bump working version to 0.0.4-SNAPSHOT, now that 0.0.3 is released [analytics/refinery/source] - https://gerrit.wikimedia.org/r/182949 [22:59:06] woot! [22:59:22] Ironholds: refinery-hive.jar with IsPageviewUDF should now be on stat1002 [22:59:25] and loaded automatically [22:59:30] you still need to do the create function thing [22:59:33] yay! [22:59:34] but aside from that, it is there! [22:59:34] oh, wait [22:59:53] [0] Is Pageview - Desktop\u002C true\u002C true\u002C en.wikipedia.org\u002C \/wiki\/Horseshoe_crab\u002C-\u002C200\u002Ctext\/html\u002C turnip (testIsPageview)(org.wikimedia.analytics.refinery.hive.TestIsPageviewUDF): wrong number of arguments [23:00:01] IOW, yeah, it's the argument discrepancy [23:00:02] (CR) Ottomata: [C: 2 V: 2] Bump working version to 0.0.4-SNAPSHOT, now that 0.0.3 is released [analytics/refinery/source] - https://gerrit.wikimedia.org/r/182949 (owner: Ottomata) [23:00:08] (using a temp value doesn't seem to work. Womp womp. [23:00:38] Ironholds: [23:00:41] there are two tests [23:00:44] one for TestPageview [23:00:48] and another for the UDF [23:00:52] if you change the csv [23:00:58] ... [23:00:59] you have to change the arguemnts for the test in both files [23:01:01] * Ironholds headdesks repeatedly [23:01:25] well, that was a tremendous waste of your time ;p [23:01:32] never employee a skim-reader to write software [23:01:43] haha, i doi that all the time! [23:02:06] ok, it is quittin time! [23:02:18] take care! [23:55:48] (CR) Nuria: [C: 2] If desktop site starts in 'www.', drop 'www.' for mobile site [analytics/aggregator] - https://gerrit.wikimedia.org/r/181396 (owner: QChris) [23:56:02] (Merged) jenkins-bot: If desktop site starts in 'www.', drop 'www.' for mobile site [analytics/aggregator] - https://gerrit.wikimedia.org/r/181396 (owner: QChris) [23:59:31] (CR) Nuria: [C: 2] Get rid of focus on daily aggregates [analytics/aggregator] - https://gerrit.wikimedia.org/r/182662 (owner: QChris) [23:59:38] (Merged) jenkins-bot: Get rid of focus on daily aggregates [analytics/aggregator] - https://gerrit.wikimedia.org/r/182662 (owner: QChris)