[01:03:40] Hey Ironholds [01:03:48] You around? [01:03:54] mhm? [01:04:07] So, I'm really digging the idea of this event thingie. [01:04:20] You still interested in hacking on a little prototype this weekend? [01:05:21] I'm not sure if I have the technical chops for a direct hack :(. I'm happy to do it when we're in SF? This weekend we've got the mystery hunt [01:05:50] Oh. No worries. I was going to suggest that we pair program one. [01:06:05] But we can do that better in person anyway. :) [01:23:04] DarTar: hola [01:23:51] hey nuria [01:24:32] DarTar: kevinator was saying you wanted to estimate teh users with js disabled and browsers that do not support js [01:24:36] *the [01:24:45] Gabriel, rather :) [01:25:01] and potentially Jared as part of his REFLEX project [01:25:10] DarTar: I can tell you how to do it with cluster data, for a pretty good estimate [01:25:39] DarTar: with the data we already have is sufficient to get a pretty good estimation I think [01:25:49] that would be great, can you post the proposal on the lists? [01:26:14] Gabriel and I were brainstorming about sending pairs of events [01:26:29] but if there’s a solution that requires no instrumentation, I’ll buy that [01:26:32] DarTar: no experiment needed [01:26:41] DarTar: what list? [01:26:48] analytics-l ? [01:27:02] ah ok, i did not see gabriel's request there [01:27:11] maybe i missed it completely [01:27:33] I can point him there, right now this is tracked on Trello/Phab but the discussion should happen somewhere where he can participate [01:27:41] phab would work too [01:27:57] Ok, lemme see if i can catch it on irc and after i will post to teh list [01:27:58] PageViews vs. NavTiming/SampleRate? [01:28:27] nuria: cool. halfak: yes I was thinking of something along these lines [01:28:30] halfak: no, it can be done with browser % versus browsers for which we do not see requests in bits for js files [01:28:54] halfak: with 2 queries if it makes sense and you find those browsers by exclusion [01:29:33] halfak: as browsers that do not support js or have it disabled (oddities aside) should not be getting stuff from resource loader [01:30:26] halfak: the pageview definition is too far off from navtiming numbers [01:30:40] Huh. I suppose that makes sense. I guess I never considered that the JS wouldn't even be downloaded. [01:30:41] halfak: even accounting for sampling [01:30:54] nuria, why do you think that is? [01:31:09] halfak: lemme see that that is the behaviour in newer browsers too [01:31:19] It makes sense. [01:31:34] Why download it if you aren't planning to execute it? [01:31:36] interesting, I had the same expectations (JS being requested regardless but not used) but it makes sense [01:31:38] Unless jsonp [01:31:39] DarTar, nuria: phab++ [01:31:44] Wait... still [01:31:51] yo gwicke [01:32:01] halfak: i do not know but last time i did the count comparing webstatscollector pageviews in eswiki (mobile) and navtiming they differ a great deal [01:32:25] nuria: not surprised [01:32:30] Interesting. We should be able to apply the pageview definition to navtiming. [01:32:35] halfak: so while it might work, i think it will be better with bits and main requests for content [01:32:53] Either that or we are categorically excluding a large group of browsers that do support js. [01:33:02] nuria, totally. [01:33:09] halfak: we do, of course, like ie6 [01:33:11] I think this might be worth exploration though. [01:33:22] Trevor brought up the idea of setting some form field in the edit window from JS, and then tracking the ratio of edits with that set vs. without it [01:33:24] Really? It won't even log for ie6? [01:33:37] halfak: js is disabled in wikipedia for ie6 [01:33:47] that would be edits only, but otoh it would probably avoid most of the bots [01:33:48] halfak: and many other browsers [01:34:10] gwicke: i think we can do it with data we already have, no need to instrument [01:34:13] gwicke, edits only? [01:34:24] halfak: views would be better / more general [01:34:25] (oh. Missed scrollback) [01:34:44] +1 for views. [01:34:45] but for the VE folks something more specialized like edits could be interesting too [01:35:07] I just fear that a large percentage of the no-JS anon views are various bots [01:35:08] halfak: so many browsers that support say js 1.1 get no js , that is std practice [01:35:09] VE should be designing for viewers, not current editors. [01:35:22] But I can see how they'd want to know. [01:35:41] nuria, cool. I think I get it. [01:38:09] for the stuff I'm interested in the percentage of clients that truly don't support JS is actually very interesting too, as we could still serve simple JS to them for simple skinning purposes [01:39:20] gwicke: ok, this is how we can do it [01:39:24] I think [01:39:51] First browsers with js disabled [01:40:20] gwicke: they will make no js requests so given that we have pageviews for main content we can get browser % [01:41:00] *nod* [01:41:02] and browser % for bits requests, compare those two lists and see browsers and request main document but do not request javascript files from bits [01:41:28] my main worry with that is that with small percentages it probably hinges on the accuracy of bots vs. actual human user detection [01:41:28] gwicke: so iE6 (as we are excluding it on purpose) will appear , but so will opera mini (the old one) [01:41:31] gwicke: makes sense? [01:41:55] gwicke: we can detect bots with data in the cluster provided UA [01:42:24] ok [01:42:28] gwicke: all these requests for bits and main document are alredy on hadoop [01:42:50] * DarTar gotta run, good night folks [01:42:54] so that would work to get you a pretty good list of browsers that are not using our js infrastructure [01:42:56] do you have info on logged-in vs. anon too? [01:43:31] based on the browser stats I'd expect logged-in views to have more JS support [01:44:28] maybe I should provide some background: we are wondering whether it's feasible to serve the same static cached HTML to logged-in users & anons, and then do the per-user customization in JS [01:44:31] gwicke: in the cluster..i do not think so as we just log the pageview, let me see cause [01:44:37] thsi changes a bunch [01:44:47] more generally, the question is whether we could consider requiring JS for the logged-in experience [01:44:52] gwicke: ya, decorating [01:45:22] gwicke: let me do one fast lookup [01:48:45] gwicke: so, the anons vs logged-in is not in the table fields that the cluster has that i can see [01:49:01] gwicke: thsu with teh method described you will get "overall" stats [01:49:09] *thus with the method [01:49:40] gwicke: they will include data from both anonymous users and logged in users mixed together [01:50:29] gwicke: still that will give you a good estimate as to volume, for example. If the answer is 20% of our users have no "js-support" [01:51:37] gwicke: if you assume 1% of logged in requests over the overall volume (making this up) it gives you a 0.2% of logged in users w.o js support [01:52:33] gwicke: it is an estimation but we have enough data to come up with significant results, even for browsers not widely represented. [01:54:24] gwicke: If you think this estimation will be sufficient for your purposes (seems like it might thus far) you can do a data request to the research team as no additional instrumentation will be needed. [02:15:29] nuria, was distracted IRL [02:15:40] reading backlog.. [02:42:55] nuria: based on the browser stats of logged-in vs. anon views, an overall estimate will probably overestimate the no-js requests [02:43:38] it'll be great to have as a starting point [02:44:32] right now we really have no clue if it's closer to 0.01, 0.1 or 1% [03:50:27] eee [03:50:31] I got asked to give a talk [04:04:40] :D [04:40:54] gwicke: got it, so then thsi method (without additional instrumentation) will work for an initial estimate, good then [05:11:04] yes, having an order of magnitude would be great already [05:11:19] nuria: ^^ [05:20:10] halfak, it's at EARL! :d:d:d [08:47:05] Ironholds: halfak I want to also pick your brains about designing a pageviews counter for toollabs, both as a definition (probably reuse yours) and as a technical implementation [08:47:16] I think providing popularity measures like these are very important [13:43:06] YuviPanda, +1 sounds like a good idea. [13:43:13] We could probably use eventlogging for the stuff in labs. [13:43:36] I don't think we have anything that is high traffic enough to warrant higher throughput solution. [13:43:47] Ironholds, EARL? [14:43:39] hey Ironholds, I'm prepping for the Hadoop namenode migration. planning onturning off hadoop in an hour [14:43:42] you've got a couple of jobs running [15:37:03] Helder, thanks for update the meeting time [15:37:46] jonas_agx, do you want to join us in 25 minutes? [15:39:00] if you PM me your google email address, I'll add you to the hangout. [15:44:17] halfak, thanks for invite me but I can't join the hangout -- there is too much noise here [15:44:41] OK. When we push cards around, we'll hop on IRC. [15:45:02] oaky [15:45:08] okay [16:08:46] Ironholds: i have killed one of your running jobs, sorry :/ [16:08:55] this one: application_1415917009743_80635 [16:41:01] ottomata, that's okay [16:41:37] halfak, EARL == Effective Applications of the R Language [16:41:46] it's the second-biggest R conference, after useR [16:41:53] (also, morning) [16:45:09] Cool Ironholds! [16:45:16] yup! [16:45:26] Tis gonna be on our framework/setup/how we do things [16:46:23] great, migration done! hadoop is back. [16:46:35] yay! [16:53:41] jonas_agx, did you end up reading https://trello.com/c/AwRoe7pG/13-read-multilingual-vandalism-detection-using-language-independent-ex-post-facto-evidence [16:53:43] ^? [17:02:54] yes, halfak i did [17:03:15] I'll mark on trello [17:12:20] halfak: how can I helpya?! :)_ [17:13:23] ottomata, Hey dude. I sent that email about our data transfer options. [17:14:10] yes [17:14:12] uhh [17:14:14] oh yeah, i should research that [17:14:24] opening up our cluster for distcp doesn't sound good when I first think abou tit [17:17:12] halfak: reading those links now [17:17:42] Great. Whichever option you think is best, I'm down to pursue. [17:19:32] hmmm, actually, ssh tunnel + distcp might work [17:19:41] still reading [17:22:41] ok halfak, i think that this ssh tunnel + hdfs commands will work [17:22:43] you ready to try? [17:22:47] http://documentation.altiscale.com/using-httpfs [17:22:47] ? [17:23:10] in a meeting. Can try in 10 mins. [17:23:14] ok [17:38:59] ottomata, looking at docs [17:40:17] ottomata, when you say "hdfs commands" should I be looking at webhdfs? [17:40:59] this one [17:41:00] yes [17:41:01] http://documentation.altiscale.com/using-httpfs [17:41:05] we need [17:41:06] ssh tunnel [17:41:11] then maybe Step 2 will work [17:41:14] Process Data with Hadoop File Commands (HDFS DFS) [17:41:31] so, i don't remember, did you get ssh tunnel from stat1002 out of bastion to work? [17:41:45] I did. [17:42:17] This looks straightforward. Sorry to bug you with it. A little more reading and I think I could have given it a shot. [17:42:48] np, the ssh tunnel bit could be a little funky, not sure, lemme know if you have trouble [17:44:20] Will do [18:01:10] hi all, I'll skip today's standup, gotta focus on other things... [18:02:00] no worries. Godspeed Nettrom [18:28:11] Ironholds: talk to me about accessMetho [18:28:11] d [18:28:15] why is it called that again? [18:29:06] because that's what we call it in all of the prior art, so it's the term people are most familiar with, and all of the other names I came up with were even more confusing :D [18:29:18] who are most people? researchers? [18:29:31] What is contained in accessMethod? [18:29:31] was it called that intentionally? or just somehting people started to say? [18:29:44] wait [18:29:46] don't answer Ironholds [18:29:47] halfak [18:30:04] ottomata, "yes" [18:30:20] halfak, "mobile web, mobile app or desktop, which is this request?" [18:30:33] what would you call a method that returned that? [18:30:36] given a request [18:30:44] was it desktop? mobile web? mobile app? [18:30:56] or, what would you call a field that contained that info [18:31:08] haha, i guess you are already primed to know what Ironholds calls it :p [18:31:09] access method? [18:31:14] oh, you're asking halfak! [18:31:17] sshh i am trying not to prompt! [18:31:34] Good q. thinking. [18:31:57] consider other field names that you know already exist in this context too (e.g. http_method, etc.) [18:32:05] "Medium" [18:32:20] Hmm... Nothing brilliant is coming to mind, but I see why accessMethod would be strange. [18:32:26] and also maybe ' Access-Control-Allow-Origin' [18:32:28] that is a header too [18:32:35] not in webrequest logs now, but it is a header [18:32:36] I would expect accessMethod to be POST/GET or www vs. API. [18:33:32] Ironholds: if halfak doesn't call this accessMethod, who else does? [18:33:34] dario? [18:33:38] ottomata, do you have a suggestion? [18:33:45] hey ottomata [18:33:54] tsup [18:33:57] ottomata, I'm a bad example. I don't work with view logs much. [18:34:01] oh, ha, didn't mean to ping you [18:34:02] aye ok [18:34:07] well, maybe dario can help [18:34:07] k [18:34:20] https://www.youtube.com/watch?v=8To-6VIJZRE [18:34:30] ^ Relevant to my nonsense in the hangout [18:34:33] dario, if you had to name a field that contained information on what type of client was used for a request, e.g. mobile-web, mobile-app, desktop, what would you call it? [18:34:35] ha ha [18:34:47] keep in mind there is already a field called http_method [18:34:52] that contains things like GET, POST, etc. [18:35:20] ottomata: hm, that’s what we called “access method” I think [18:35:24] Ironholds: ^ [18:35:36] wait, no [18:35:39] type of client [18:35:43] not destination [18:36:08] ? [18:36:20] DarTar, i'm pestering Ironholds because I really do'nt like this name [18:36:21] :p [18:36:30] i'm trying to get a sense of how baked in it is already [18:36:44] then what would you propose? [18:36:50] ha ha. I probably need a bit more context, can we talk later? I have to clean up the mess on my desk and move to the new one that just arrived [18:36:59] in the next 20 mins [18:37:01] bbl [18:37:04] ha sure, we'll be talking here [18:37:29] well, Ironholds, this seems to me to be ultimately about the type of client that is being used [18:37:33] How about "access medium" [18:37:40] Oh yeah. "client type" [18:37:45] I'd buy that. [18:38:25] a browser is a client, a CLI util is a client, an installed app is a client [18:38:45] except "client type" confuses with device type. [18:38:58] yeah, maybe, but device is different, no? [18:39:04] device is the hardware [18:39:31] it is possible to be on an iphone and be classified as desktop, no? [18:39:38] if you forcefully browse non mobile web? [18:40:19] client could be the browser that they are using. [18:40:22] or vice versa, navigating to en.m.wikipedia.org on your desktop [18:40:36] true, that's why i suggested clientType or clientClass or something [18:40:43] clientClassification [18:41:09] In the app case, the app *is* the "browser" [18:41:15] perhaps even client is not accurate though, because I am using my client desktop browser, but will be classified as mobile-web if i go to en.m [18:41:39] perhaps that is why they are using 'access'? [18:41:40] :p [18:41:40] yup [18:41:49] because, as said: every other option is worse ;p [18:41:56] this is to definitions what democracy is to politics: least-bad. [18:42:06] accessType? [18:42:13] (the method part is what bothers me the most) [18:42:29] i woudl also think that accessMethod must have osmethign to do with HTTP method [18:43:53] requestType, requestClass [18:44:15] accessClassification [18:44:20] accessClass [18:44:39] hah, i guess it would be good to have dario back, if the name accessMethod is so baked in that researchers already use it for tons of things, it might be too late [18:47:30] haha, Ironholds, i love that you are scolding ellery for his tabs :p [18:47:31] haha [18:47:46] ottomata, I learned my Java from the best ;p [18:47:50] Ironholds: do you think ellery's code would nicely live in the Webrequest class we are working on? [18:47:58] the HostParser stuff? [18:48:24] I think so; it makes sense to move it there, although honestly I think the internals need a rework [18:49:50] aye cool, we'll work through the review [18:50:04] and get him to put it in Webrequest after we get it merged [18:50:05] it is close! [18:50:31] i think it is ready except for this name unsatisfaction! but i can be convinced! [18:53:09] yup [18:53:22] my feeling is we should tokenize the host and do full-string matching rather than regexes, but we'll see [18:55:29] aye that would make sense mabye [18:55:47] Ironholds: is there a a researcher convention for what to call what ellery is trying to get? [18:55:51] i always just call it project [18:56:41] ottomata, not really; we've been talking through the terminology. I mean, language, project, right? except "language" isn't always language; look at wikidata or the wikimedia projects. [18:57:02] So I've been battering around the idea of calling them "variant" and "project". en.wikipedia.org is the english-language variant of the wikipedia project. [18:57:19] commons.wikimedia.org is the media commons variant of the wikimedia project [18:57:41] it's not much better, but it's at least not immediately contradicted by reality, which language/project is as soon as you go outside *.wikipedia.org [18:57:54] is wikipeida a project? or en.wikipedia a project [18:58:04] i.e. is en.wikipedia a differnet project thatn es.wikipedia/ [18:58:04] ? [18:58:07] i assumed it was [18:58:35] hm, guess not [18:58:35] http://wikimediafoundation.org/wiki/Our_projects [18:58:37] ok [18:58:38] so [18:58:44] variant makes sense then [18:58:55] or something, looks like maybe ellery is calling it qualifier [18:59:50] Ironholds: what about thing slike [18:59:53] en.m.wikipedia? [18:59:58] variant is? [19:00:09] english-language [19:00:23] what woudl you call the .m. part? [19:00:26] for project extraction we don't actually care about (m|zero|wap|mobile) I don't think. ewulczyn? [19:00:33] I'd just exclude it. Similar to www in that regard. [19:00:44] hm, ok, maybe that is the 'qualifier'? hehe [19:00:55] haha, or maybe we could sync up with the name we pick for access method here :p [19:00:55] ? [19:02:25] yeah, I'm just throwing out the mobile qualifiers. [19:02:41] yep [19:02:44] I adopted the variant terminology that Ironholds suggests [19:03:14] ewulczyn, would it be useful for me to write a quick-and-dirty version that uses the tokenization approach I was thinking on? [19:03:26] that way you can check it out, see if the logic makes sense, and we can swap it in if so. [19:03:32] I could provide an option "ISMOBILE", but that would be better handeld by a access method udf due to apps. [19:03:40] regexes make me nervous [19:03:50] DarTar: shall we review? halfak and I are in the meeting [19:03:51] I find a regex much easier to interpret than a bunch of conditions. [19:04:06] huh. Fair! [19:04:08] leila: coming, just done reshuffling 3 desks [19:04:10] success! [19:04:17] ooooo. congratulations! [19:04:25] anyway, meeting! [19:04:29] will be in shortly [19:04:50] well, except for, i think a regex is likely to be less efficient, but whatever! :p [19:05:01] k [19:07:07] ottomata, that was my thinking too [19:07:11] and they're very consistent strings [19:07:51] that is something we can refactor later if we like or hafta, if you and/or ellery don't want to do it now [19:08:09] I'd like us to at least benchmark it; this is something where I can see it being part of the ETL pipeline, so speed is important [19:08:17] I'll write up a quick-and-dirty example now just to show how simple the logic is [19:10:56] OO, ewulczyn, it looks like something is wrong with one of your oozie jobs, since I rebooted the cluster [19:11:32] ottomata: do you have the id? [19:11:47] i think this coordinator 0024710-141210154539499-oozie-oozi-C [19:11:52] here is an application that falied [19:11:58] ahttp://localhost:8088/cluster/app/application_1421426017437_0005 [19:12:00] ack [19:12:13] application_1421426017437_0005 [19:12:14] ewulczyn, pentaho lives, btw [19:12:57] hmm, i see a missing hour for hour=1, but i don't think that should cause a failuer, just a job delay [19:12:57] hm [19:13:12] checking... [19:14:05] hm [19:14:09] not sure those are related actually [19:14:26] i see some timedout jobs for oozie, possibly missing data in the wmf.webfrequest refined table. checking on that. [19:14:34] but, those don't look like they are the failed apps [19:14:54] ellery, which coordinator runs this query? [19:14:54] INSERT INTO TABLE ellery.impressi...a.spider [19:14:57] ewulczyn: ^ [19:15:22] mc_coordinator_v0_1 or record_impression_coordinator_v0_1? [19:17:13] ottomata: all the coordinators I care about are running (as seen in hue) [19:17:43] I have a bunch of killed coordinators with the same name [19:18:20] hm, the data is there, hm. but it look slike it didin't come soon enough? [19:18:26] i'm going to try rerunning one of your jobs ellery [19:18:28] @170 [19:19:11] ah no, the mobile hour is misssing [19:19:11] hm [19:21:02] are your coordinators for wmf.webrequest dependent on correct sequence stats? [19:21:57] I set a 600 minute timeout on record_impression_coordinator_v0_1 [19:23:50] ewulczyn: yes [19:25:22] ottomata, thoughts on pom.xml removal? re the webrequest class [19:27:09] Ironholds: git reset [19:27:39] ta [19:29:01] mabye [19:29:02] ottomata: so my jobs timeout becuase the partitions aren't there, beacuse the sequence stats did not pass on wmf_raw.webrequest? [19:29:04] that's a guess [19:29:30] hmm [19:29:32] actually, Ironholds [19:29:33] try [19:29:45] git fetch; git checkout origin/master refinery-tools/pom.xml [19:30:03] yes. i thikn that is right ewulczyn, except, i do see an actual failed yarn job [19:30:11] your oozie ones did fail because of timeout, but they never tried to launch a yarn job [19:30:15] ta [19:30:21] so there seem to be two issues [19:31:28] ottomata, thiiink it worked. check it out! [19:32:19] i'm looking into the missing data one now ewulczyn [19:32:33] i think we rarely have missing mobile data, so i'm not sure about that one. [19:37:02] cool i think that's good Ironholds [19:37:10] now just need to settle name of that method :) [19:37:23] haha [19:40:55] okay, I'mma go and work from my friend Wendy's house for the rest of the day [19:40:58] (humour at "Wendy house") [19:41:50] halfak: wait, in what hangout are we? [19:42:12] https://plus.google.com/hangouts/_/wikimedia.org/rd2014?authuser=0 [19:42:15] ewulczyn, ottomata, I'll start an email thread about the UDF [19:42:16] SOrry. I moved over. [19:42:21] Thought that would be more intuitive [19:42:28] just to keep everyone looped in, because there are naming implications [19:42:45] k [19:42:47] cool [19:42:49] i like thank you Ironholds [19:43:26] *thumbs up* [19:45:29] ewulczyn: did I accidentally suspend this oozie coordinator, or did you do it on purpose? [19:45:46] 0009450-141210154539499-oozie-oozi-C [19:45:50] ob Name : hive_banner_impressions [19:45:50] App Path : hdfs://analytics-hadoop/user/ellery/record_impression/oozie/coordinator.xml [19:45:53] Status : SUSPENDED [19:45:54] Start Time : 2014-12-24 18:00 GMT [19:45:54] End Time : 2050-01-01 06:00 GMT [19:46:00] I suspended it [19:46:03] ok cool [19:46:13] i think i accidentally had some suspended stuff from when I paused the cluster today [19:46:15] making sure [19:54:48] hm, ja, ewulczyn, i think we need to be a little less strict on when we mark hours as ready. liek you suggested. I would almost like to add the _SUCCESS flag as long as there is less than 1% loss or seomthing [19:55:23] i'm looking at the data missing for one of the mobile hours that recently didn't have data (2015/01/16T01) [19:55:32] whatever happened, there is very little data missing [19:55:41] but the few missing lines caused all further processing to not happen :/ [19:57:18] that would be great, especially for daily jobs. I would like my daily jobs to wait on 24 hourly paritions, but the proabbaility of the job halting because a success flag is missing for one of them is too high. So currently I just wait on the last partition I need (i.e.hour=23 partition) [19:58:08] yeah [19:58:12] oo yuck, yeah [19:58:31] ok i will try to work on that, it will be easiest to add a special done flag for when the partition exists. [19:58:37] it'd be nice to have even another one [19:58:45]