[12:39:30] (CR) QChris: [C: -1] Use tsv format when outputting webrequest faulty hosts files (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/150963 (owner: Ottomata) [12:40:14] Analytics / Refinery: Story: AnalyticsEng has kafkatee on analytics1003 - https://bugzilla.wikimedia.org/68246#c2 (Andre Klapper) (In reply to Kevin Leduc from comment #1) > re-prioritizing to highest so it gets pulled into next sprint This was marked as highest priority two months ago and no progress... [12:40:17] Analytics / Refinery: Epic: AnalyticsEng has kafkatee running in lieu of varnishcsa and udp2log - https://bugzilla.wikimedia.org/68139#c2 (Andre Klapper) (In reply to Kevin Leduc from comment #1) > reprioritizing to highest so it gets pulled into next sprint This was marked as highest priority two mon... [12:40:17] Analytics / Refinery: Story: AnalyticsEng generates new datafiles using kafkatee - https://bugzilla.wikimedia.org/68247#c2 (Andre Klapper) (In reply to Kevin Leduc from comment #1) > re-prioritizing to highest so it gets pulled into next sprint This was marked as highest priority two months ago and no... [12:41:54] (CR) Ottomata: Use tsv format when outputting webrequest faulty hosts files (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/150963 (owner: Ottomata) [12:43:30] Analytics / Refinery: Story: AnalyticsEng generates new datafiles using kafkatee - https://bugzilla.wikimedia.org/68247#c3 (Andrew Otto) There is a bug in kafkatee that is keeping us from moving forward. I have been in touch with Magnus E about it, but he has not yet had time to troubleshoot. [13:27:13] (PS1) Gilles: Generate TSV for versus test running on new labs machine [analytics/multimedia] - https://gerrit.wikimedia.org/r/164062 [13:27:50] (CR) Gilles: [C: 2 V: 2] "Query tested by SSHing the SQL server" [analytics/multimedia] - https://gerrit.wikimedia.org/r/164062 (owner: Gilles) [13:37:43] ottomata, want a more fun challenge? after yesterday's image one? [13:38:47] hmn. wait. I can fix this. [13:43:30] wait, no I can't. bah! [13:50:29] ottomata, how would you go about eliminating, say, 40px- from the start of the filenames that regex you produced is generating? [13:50:39] note that the 40 is...any number of \ds [14:01:19] Ironholds: do all filenames you are selecting have that pattern? [14:02:02] nope! [14:02:06] just to make it /extra/ fun :P [14:05:06] (PS1) Gilles: Add graph for new versus test machine [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/164068 [14:05:24] (CR) Gilles: [C: 2 V: 2] "Tested on limn locally" [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/164068 (owner: Gilles) [14:29:42] qchris_meeting: FYI: [14:29:49] "The moment the leader moves to another broker, the partition's source of [14:29:49] truth is the new broker's log and other followers truncate their logs to [14:29:49] follow the new leader. So, any unreplicated messages that didn't reach the [14:29:49] new leader are lost. If the old leader rejoins ISR, it will also truncate [14:29:49] it's log to follow the new leader's log." [14:33:35] ottomata: That makes perfect sense. [14:33:43] So looks like we're on the right path with the acks [14:34:38] ja, makes total sense, I agree [14:35:25] Also I've been thinking about the timeout settings... [14:35:43] Are they maybe too high? [14:36:40] Because in case of a temporary error, [14:36:51] detection of the error might take ages. [14:37:08] So we try to resend only looooong after the initial temporary error. [14:37:25] Hence buffers are fuller than needed in case of temporary errors. [14:38:04] Somewhat akin to bufferbloat. [14:42:34] qchris, you are referring to topic_request_timeout_ms? [14:42:58] yes [14:43:19] (But I am not sure my line of thought makes sense) [14:43:42] ottomata,qchris: any ideas on why http://metrics.wmflabs.org/ wouldn't resolve for me but it does for Dan [14:43:44] ? [14:44:18] nuria_: wfm [14:44:31] can you resolve other wmflabs.org domain names? [14:45:05] qchris: no, quarry [14:45:06] like for example http://quarry.wmflabs.org/ [14:45:11] doesn't look to work either [14:45:21] right i was trying just that [14:45:30] qchris, yes, that is true, buffers will be larger with a larger timeout. [14:45:43] but, this was increased in order to overcome potential cross DC latencies [14:45:52] especialy when we were seeing them from esams [14:45:56] but, we haven't had problems with that recently. [14:45:57] qchris: but i have not changed anything on /etec/hosts or anything [14:45:59] ottomata: ok. thanks. [14:46:26] nuria_: You can resolve other fresh domains that you did not resolve before? [14:46:43] like http://core.servus.at/ [14:46:57] qchris:yes [14:47:12] Ok. Can you resolve outside of your browser? [14:48:04] qchris: with curl yes [14:48:11] i can resolve the one you sent me [14:48:20] Ok. [14:48:31] Did you try closig the browser, and starting it again? [14:48:35] qchris: but not wmf [14:49:08] qchris: yes, I cannot resolve (with curl) [14:49:22] but, qchris, i wonder if we could reduce that timeout [14:49:23] http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&title=&vl=&x=&n=&hreg[]=cp3.%2A&mreg[]=kafka.rdkafka.brokers..%2Artt.avg>ype=line&glegend=show&aggregate=1 [14:49:27] wmf domains [14:49:29] those are esams rtts [14:49:37] there isn't a max recorded of more than around 3 seconds [14:49:44] (units here are microseconds, pretty sure) [14:50:08] well, i guess there are few jumps up to 4 or 5 seconds [14:50:10] ottomata: Right. [14:51:31] nuria_: Not sure ... [14:51:45] man ...will ask in labs [14:51:54] nuria_: Can you resolve on other machines in your network. [14:52:09] ok. [14:52:23] Ironholds: [14:52:24] maybe [14:53:08] regexp_extract(uri_path, '/(\d+px-)?([^/]+)$', 2) as uri_file [14:53:21] hhmn [14:53:23] * Ironholds will test! [14:54:10] qchris: no, man what happen to my connection... ains... [14:55:13] ottomata, no such luck :( [14:55:36] qchris: bit i cannot resolve wikimetrics in 3g either... puf at least is not me ... [14:55:36] I think the optional nature of the \d+px- means it's getting ignored if regexp gets a match without relying on it. [14:55:53] hmm, the problelm might be the capture. [14:56:08] the (\d+px-)? should mean 0 or 1 of those patterns [14:56:13] the ? [14:56:14] there [14:56:32] hm [14:56:49] hm [14:56:50] try this? [14:56:59] qchris, you may know the answer to this (unrelated) question; if I grab the sampled log file for [date], should it only be going up to 6am on [date]? With the rest presumably in the next day's files? [14:57:00] try it in the capture? [14:57:02] /((\d+px-)?[^/]+)$ [14:57:19] * Ironholds runs [14:57:24] ..the query, that is [14:58:16] nuria_: That's ok. Then I'd just add metrics.wmflabs.org in the /etc/hosts file for now and retry in an hour. [14:58:23] nuria_: It's IP is 208.80.155.156 [14:58:41] nuria_: That way, you can go on working while others fix the issue [14:59:06] qchris: will do thank you [14:59:15] yw [14:59:45] qchris: for amoment i though my ISP had flagged wmf domains and i was like ... man not ANOTHER connection problem [14:59:54] :-) [15:00:00] Ironholds: more 06:30 am. [15:00:12] ottomata, NULLs with ,2), produces the results with Npx- with ,1) [15:00:26] Ironholds: So the file with 20141001 in the filename would cover [15:01:05] 20140930, from 6:30am onwards, to 20141001, at 6:30am? [15:01:06] Ironholds: ~2014-09-30T06:30:00 until ~2014-10-01T06:30:00 [15:01:10] yep [15:01:13] * Ironholds nods [15:01:15] Right. [15:01:19] * qchris nods too [15:01:28] okay, cool! Just trying to confirm that the file for [date] does not contain [most of the requests from date] [15:01:42] so, [date] represents the date capture was ended on, not the date represented by the data. [15:14:11] hmm, strange Ironholds, my tests in java say my regex works, but it doesn't really workin hive [15:14:46] odd [15:15:29] annnyyyyway, i'm getting ready to start the stat1002 upgrade [15:15:46] which means it will go offline! you ready for that? [15:16:44] Ironholds: ^ [15:16:57] err [15:16:59] * Ironholds thinks [15:17:04] let's go with "yes" [15:19:20] (PS1) Milimetric: Add metric definition links [analytics/dashiki] - https://gerrit.wikimedia.org/r/164083 [15:20:18] (CR) Milimetric: [C: 2 V: 2] Add metric definition links [analytics/dashiki] - https://gerrit.wikimedia.org/r/164083 (owner: Milimetric) [15:20:50] nuria_: self merged my stuff. you're unable to test the clicking fix because of the DNS issues? [15:21:17] milimetric: no i can test it via /etc/hosts cause i can ping the ip [15:21:27] qchris gave the ip number [15:21:31] oh cool [15:21:47] but I think it needs some chnage [15:21:54] oh i guess you can also test with the stub files, 'cause actually grabbing the data is not important - just the config [15:22:13] oh ok, let me know when you're ready to talk about it [15:22:34] the ko.bindingHandlers.projectAutocomplete.mouseDownNotImportant is taking three values, undefined, null and false/true [15:22:43] seems a little convoluted [15:23:23] what about? [15:23:27] (PS1) Gergő Tisza: Update schema revision number for NavigationTiming [analytics/multimedia] - https://gerrit.wikimedia.org/r/164084 [15:23:39] // set up a single mousedown / mouseup handler that informs the blur handler above [15:23:39] if (mouseDownNotRelevant) { [15:23:39] $(document).on('mousedown', function (event) { [15:23:39] mouseDownNotRelevant = [15:23:39] $(event.target).closest('.tt-suggestion').length === 0; [15:23:40] }); [15:23:40] $(document).on('mouseup', function () { [15:23:41] mouseDownNotRelevant = false; [15:23:41] }); [15:24:53] ^milimetric [15:25:05] DUHHH [15:25:06] Ironholds: [15:25:11] double escape needed for java regex [15:25:22] '/(\\d+px-)?([^/]+)$' [15:25:44] huh! [15:25:46] well, see [15:25:48] funny story [15:25:57] while I could test that, the machine with the hive client on is...*grins* [15:26:01] haha [15:26:04] its not gone yet! [15:26:04] do it now [15:26:05] nuria_: the reason for 'undefined' is so we know not to create the event handler twice [15:26:19] Ironholds: BTW, you are the one who wants this upgrade anyway! :) [15:26:39] and mouseDownNotRelevant = false means there was a mouse down but it's not relevant [15:26:49] = null means that this doesn't apply [15:27:01] Ironholds: as soon as you test that, i'm going to start the upgrade :) [15:27:23] milimetric, sorry  if (mouseDownNotRelevant==null) { [15:27:25] in essence, it's weird but all four states mean something. I'm fine with changing them to {'not initialized', 'not applicable', true, false} [15:27:31] ottomata, fair! [15:27:44] and mouseDownNotRelevant: null as part of [15:28:13] ko.bindingHandlers.projectAutocomplete = { [15:28:13] mouseDownNotRelevant: null, [15:28:44] milimetric , you only need a lock , plus true/false, right? [15:29:17] milimetric: need to get going for a doctor apointment [15:29:32] nuria_: k, we'll talk when you get back [15:29:47] (CR) Nuria: Fix nondeterministic project selector (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/163892 (https://bugzilla.wikimedia.org/71333) (owner: Milimetric) [15:31:05] ottomata, no dice [15:31:07] run the upgrade [15:31:56] PSSHHH [15:32:06] it worked for me! in my test query that didn't actually select any data! [15:32:12] but ok, upgrade starting [15:32:19] Ironholds: you won't actually be kicked off for a while [15:32:22] so we can probably keep trying [15:32:49] fair! But, I should grab lunch ;p [15:34:00] ok [15:34:03] Ironholds: [15:34:06] gimme the query you are using to test [15:34:12] totally [15:34:46] * Ironholds gists [15:35:47] ottomata, https://gist.github.com/Ironholds/a68d660fe4b2e1b65284 [15:39:44] Ironholds: [15:39:45] https://gist.github.com/Ironholds/a68d660fe4b2e1b65284 [15:39:46] it works [15:39:47] see comment [15:40:01] aha [15:40:03] danke schoen! [15:46:30] Ironholds: ! who told you to call me MEESTAR otto?! [15:46:50] MISHTER OTTO! [15:47:22] ottomata, given that I hired all of Rachel's employees, we do talk occasionally. [15:47:24] ;p [15:47:35] haha [16:15:40] Hey nuria_, I'd like to move the sendBeacon meeting earlier in the day on Friday. Would 11AM PDT be OK? [16:16:07] It looks like your calendar is open. [16:16:51] halfak: she's away for a couple hours [16:16:56] Thanks milimetric [16:20:30] Ironholds: stat1002 upgraded to Trusty. will you test that is has what you want? [16:20:33] newer R? [16:20:42] can't remember if I have to manually upgrade it. [16:35:04] (PS1) QChris: Document agreement on HiveQL filenames starting in a verb [analytics/refinery] - https://gerrit.wikimedia.org/r/164102 [16:39:59] ottomata: Does that mean you're done with upgrading stat1002 and you're eager to discuss pagecount names? [16:40:16] yes! [16:40:29] Awesome [16:40:39] So I again went over the names from 2014-09-25 [16:40:45] and I guess I like http://dumps.wikimedia.org/other/pagecounts-all-sites [16:40:48] the most. [16:40:55] (CR) Ottomata: "EhhHHHhhh, I'd rather this not be a hard and fast rule, but just a convention that should be followed if it makes sense." [analytics/refinery] - https://gerrit.wikimedia.org/r/164102 (owner: QChris) [16:41:11] hm [16:41:21] ottomata: You're to fast with seeing changes :-/ [16:41:38] (CR) QChris: "See ottomatas comment from 2014-09-26 on" [analytics/refinery] - https://gerrit.wikimedia.org/r/164102 (owner: QChris) [16:42:21] I so took care to not post my comment on that change, as it would have pinged you [16:42:40] Mhmm ... naming of pagecounts first :-) [16:42:54] yup [16:43:03] pagecounts-all-sites, eh? [16:43:03] hm [16:43:07] Not sure. [16:43:10] projectcounts-all-sites [16:43:12] yeah [16:43:13] hm [16:43:15] It's just what I liked best from the last brainbounce. [16:43:23] What do you prefer/suggest? [16:44:10] thikning.. [16:44:45] Logs from previous discussion are at http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140925.txt [16:44:53] did we talk about calling this webstats at all? [16:44:54] Starting ~21:18:08 [16:45:04] Not really. [16:45:12] You said that the dataset is called webstats. [16:45:26] pagecounts-webstats [16:45:26] projectcount-webstats [16:45:26] ? [16:45:26] I did not push back, because the dataset will go away sooner or later anyways. [16:45:38] webstats-pagecounts? [16:46:06] or, maybe if we make that the public name, it will be confusing to hte public how the data is generated [16:46:08] But that's not really in spirit of the existing pagecounts-raw and pagecounts-ez [16:46:18] pagecount-webstats isn't? [16:47:01] Oh ... as you also said projectcounts-webstats, I thought you called out the filenames within that directory. [16:47:15] pagecounts-webstats is in the spirit. [16:47:29] But pagecounts-webstats sounds confusing to me. [16:47:33] do you like pagecounts-all-sites better than pagecounts-all? [16:47:55] A bit, because it puts the "all" in context. [16:48:05] But still ..longer [16:48:21] s/longer/-all-sites is longer than -all/ [16:48:46] So I guess I'd agree to pagecounts-all too. [16:49:14] But then ... "all" somehow implicitly means "all pageviews" [16:49:32] But that in some sense holds less true than "all-sites". [16:49:45] Because action=mobileview does not get counted. [16:49:59] aye [16:50:02] hm [16:50:35] pagecounts-bikeshed [16:50:41] guys can we not have a description that explains this a bit better than a couple of words after a hyphen? [16:51:06] so, just to confirm, this is the name for the temporary solution to webstatscollector lacking the mobile site? [16:51:24] pagecounts-extending-old-known-broken-webstatscollector-pageview-definition-to-all-sites-as-a-stop-gap-measure [16:51:26] we really don't need to spend an extended period of time debating the neame. [16:51:46] Ironholds: It has all sites. desktop, mobile, and zero. [16:51:59] okay. But the important bit in my comment was: temporary. [16:52:47] like: unless we fail horribly at our jobs, this will probably be nuked from orbit in a matter of months. [16:52:52] Its name is not a big deal. [16:53:18] Ironholds: qchris and I can debate this name as long as we well like, thank you very much! [16:53:31] names are important and they stick and become part of our vocabulary [16:53:54] but all that's being discussed, is what you add on to the end of [name] to explain to people what it does. [16:53:59] What it does is implementation-specific. [16:54:22] unless producing a solid PV system is something we fail to do, the implementation-specific stuff will no longer be needed. [16:54:28] shhhhhh [16:54:32] naming is fun [16:54:34] :-) [16:54:46] And it help getting clarity about things. [16:54:53] and yes, you guys are absolutely free to debate whatever you want for however long you want, but I have a PV proposal that needs feedback and commentary and everyone in product pinging me weekly going "so how about them pageviews"? [16:55:05] so, while I cannot do more than politely ask you guys to stop bikeshedding... ;p [16:55:35] (apologies for being grumpy) [16:55:48] Ironholds: I saw your comments from yesterday. I'll respond to them by today. [16:57:52] Would pagecounts-stop-gap work? [16:58:00] After all ... it is a stop-gap measure. [16:58:00] qchris, i am coming around to pagecounts-all-sites [16:58:13] * qchris is silent then :-) [16:58:36] pagecounts-omin [16:58:36] haha [16:58:38] omni* [16:58:43] :-D [16:59:12] pagecounts-stop-gap would express that it is a step forward, but that we intend to improve upon it. [16:59:19] qchris, thanks. Again, sorry for being annoyed. It's been a really stressful week, which is not an excuse for taking it out on you guys. [16:59:42] * qchris hugs Ironholds [17:00:24] qchris: let's go with pagecounts-all-sites [17:00:26] i'm cool with it [17:00:58] Ok. [17:01:12] pagecounts-all-sites it is :-) [17:01:27] Any vetos from the others? [17:01:40] 3,2,1 .... you've had your chance. [17:02:35] ottomata: I am still missing the final "ok� from legal. So please do not yet rsync the files over. [17:04:42] k [17:05:20] Ha ... just in time. Legal just sent an email saying "That's fine to release." [17:05:22] :-) [17:06:14] oh, awesome [17:06:16] shall we do it then? [17:06:39] Totally! [17:06:59] I am just sending them a "thank you" message. [17:13:02] ottomata, quick question [17:13:08] wait, hangon [17:13:11] I know the answer, I think [17:21:30] qchris + ottomata: congratulations! naming + legal approval coinciding is a good omen [17:22:10] Hahaha. With that coincidence ... I am only waiting to see something explode any minute :-) [17:41:35] qchris: can you write a little README file and stick it at /wmf/data/archive/webstats/? [17:41:38] that way it will get rsynced over [17:41:58] just something describing the dataset [17:42:02] Sure. [17:42:10] I guess in the spirit of the other pagecounts-files [17:42:16] Plain txt? [17:42:19] yup [17:42:23] Cool [17:42:28] or html i suppose, but ja plaintext i thin is better [17:42:36] README.txt [17:42:36] :) [17:53:38] ottomata: yurikR needs your help, he just mentioned it in scrum of scrums [17:54:13] also, editing team needs support from ops: https://wikimedia.mingle.thoughtworks.com/projects/scrum_of_scrums/cards/142 [17:54:28] and services team is not sure who to ask for help on https://wikimedia.mingle.thoughtworks.com/projects/scrum_of_scrums/cards/144 [17:54:32] ottomata: ^ [17:55:03] ACK [17:55:05] I missed SoS [17:55:21] oh i thought you were busy 'cause you never miss it [17:55:29] sorry, should've pinged you [17:55:44] AHhhhhwwwww crap crackers [17:55:47] how'd I miss it! [17:55:57] i chose not to go to a cafe so I could do it from home [17:58:21] I don't even know how to contact just those folks! [17:58:22] agh! [17:58:38] crap crap crap [18:07:27] ottomata: who do you need to contact, sorry [18:08:05] s'ok, i just emaield the engineers list, where SoS email seem to go :) [18:10:08] ottomata, any chance you can use your voodoo ops powers to get RT 8434 bumped? [18:12:11] CCed sean and bumped [18:13:22] ta! [18:41:56] headed to a cafe, back in a bit [18:43:34] (PS1) Milimetric: Build for metrics meeting [analytics/dashiki] - https://gerrit.wikimedia.org/r/164135 [18:43:46] (CR) Milimetric: [C: 2 V: 2] Build for metrics meeting [analytics/dashiki] - https://gerrit.wikimedia.org/r/164135 (owner: Milimetric) [18:49:53] (PS4) Milimetric: Add Rolling Recurring Old Active Editor [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/161521 (https://bugzilla.wikimedia.org/69569) [18:50:28] * milimetric afk for two hours [19:29:59] Analytics / Tech community metrics: Allow contributors to update their own details in tech metrics directly - https://bugzilla.wikimedia.org/58585#c31 (Jicksy) Alvaro, Sorry for the delay in returning. Some stufs held me up- now I'm free of it. Hope, I am not very late. I deployed Sarvesh's code on... [19:48:37] milimetric|afk: yt? [19:50:05] had to look it up… afk stands for “away from keyboard” [19:53:29] hey -- can you check to see if the help links work for you on the latest release? [20:03:40] kevinator: ~1 hour ago, milimetric said to be away for 2 hours. So he should be back in ~1 hour. [20:04:08] ok thanks. Toby came to my desk and I answered his question. [20:07:21] ottomata: Since I did not like the stuff that I put in the README.txt too much, I created [20:07:23] https://wikitech.wikimedia.org/wiki/Analytics/Pagecounts-all-sites [20:07:30] Which makes the info way easier to edit. [20:07:44] Do you think it's ok if the README.txt basically just links to that, [20:08:00] or should we provide real content in the README.txt [20:08:45] Hm, qchris, I kinda think the content should be in README.txt, but you've got come fancy wikimarkup there! [20:09:09] dunno, maybe its ok to link [20:09:19] there's a .mediawiki or something [20:09:24] if you're looking for github rendering [20:09:47] What about, I just say at the top of the README that up-to-date info is at the wikipage, and keep the static version in the README.txt [20:10:18] JetLaggedPanda: Thanks! But we're looking for basic lighttp :-) [20:10:47] ah, ok :) I was just randomly flaffing around :) [20:11:07] * JetLaggedPanda suggests setting up a mediawiki install, and making each dump file be a page in a special namespace [20:11:08] * qchris grabs the dictionary to look up flaffing [20:11:17] Hahahaha :-D [20:14:05] qchris: I love the wikipage Pagecounts-all-sites … super useful [20:14:42] qchris: however I think the description of the .m domain is misleading. [20:15:07] isn’t it m.*.org abreviated as .m? not wikimedia.org? [20:15:19] Both. [20:15:30] That's where the webstatscollector legacy comes in. [20:15:55] oh, so it’s not just for mobile sites? [20:16:15] Sadly enough, it isn't. [20:16:35] Webstatscollector uses it for whitelisted mediawiki domains [20:16:41] ('commons', 'meta', 'incubator', 'species', 'strategy', 'outreach', 'usability', 'quality') [20:16:47] So we have to continue to do so. [20:17:14] Luckily, it is not ambiguous, as no language is called for example 'strategy' [20:17:30] Hence 'strategy.m' is strategy.mediawiki.org [20:17:37] 'en.m' is 'en.m.wikipedia.org' [20:18:06] That's not totally straight forward. [20:18:46] But it's required to make backwards-compatibility, and easy consumption work simultaneously. [20:19:13] Suggestion welcome to improve that. [20:19:24] yeah, so I wanted to use SQL for all mobile wikipedias, I need to exclude some cases [20:19:36] er “if I wanted to” [20:20:12] Yup. That query is the worst case. [20:20:23] You need to exclude 'commons', 'meta', 'incubator', 'species', 'strategy', 'outreach', 'usability', 'quality'. [20:20:47] However, that approach makes all other use cases way easier. [20:21:15] could you add that to the notes in the page? I think it would be useful. [20:21:44] It's a wiki ;-) [20:21:53] But I'll add it. [20:22:18] thank you :-) [20:35:14] ottomata, did we find a functioning regex in the end? [20:35:51] wait, we did. It was on gist. [20:38:03] yup, in the comment of your gist [20:39:16] qchris, I'm cool with that; addin gthe text but linking [20:39:19] actually I don't really care :) [20:39:42] Ok. [20:39:45] Cool. [20:59:55] (PS12) Nuria: Bootstrapping from url. Keeping state. [analytics/dashiki] - https://gerrit.wikimedia.org/r/160685 (https://bugzilla.wikimedia.org/70887) [21:12:44] Hey nuria, I'd like to move the sendBeacon meeting earlier in the day on Friday. Would 11AM PDT be OK? [21:13:23] much better actually halfak, i was going to have to reschedule [21:13:32] Woo! [21:13:50] :) [21:54:30] ottomata: Just to make sure I make good use of tomorrow morning ... [21:54:37] yes sir! :) [21:54:46] Is there anything left to prepare for pagecounts-all-sites? [21:55:03] Just aper gos chiming in on the patch [21:55:10] hm, no don't think so, I see your final nits, will fix them before merging [21:55:13] yes, pretty much just that [21:55:14] A first rsync run and we're done? [21:55:17] yup [21:55:36] Meh. Ignore my nits. I voted CR+1. [21:55:42] Ok. Thanks. [22:39:45] qchris: still around? [22:39:51] yup. [22:40:04] what's up? [22:40:12] generic java q for you [22:40:59] https://gist.github.com/ottomata/cd195c36e7e574877be3 [22:41:17] evaluate(org.apache.hadoop.io.Text) in org.wikimedia.analytics.refinery.hive.GeocodeCountryUDF cannot be applied to (org.apache.hadoop.io.Text,org.apache.hadoop.io.Text) [22:41:30] trying to overload evaluate() here [22:42:23] there's somethign really simple i'm doing wrong... [22:44:21] evalulate has two ls [22:44:33] There is a typo in line 40. [22:44:37] ottomata: ^ [22:44:59] !!!! [22:45:00] :D [22:45:02] thank you. [22:45:05] yw [22:45:10] phew! [22:45:27] i woul have been staring at that for at least another 30 mins if I hadn't ask your keen eyes to take a look! [22:45:42] No you wouldn't! [22:46:00] But seeing that code ... do we want to have xff in the geocoding? [22:46:22] I'd much rather have a ResolveXffUDf (which [22:46:38] handles the X-Forwarded-For resolving and spits out the IP we care about, [22:46:40] agreed. [22:46:52] and then a ... ok :-) [22:46:57] but, the UDF is still likely going to have to take multiple args, like [22:46:59] ip, xff [22:47:15] and, it will know how to take those and return the client IP [22:47:22] which we'd then pass to geocode udf [22:47:28] (i'm just playing around right now) [22:47:45] Yes sure ... but I can see use in that code :-) [22:47:47] i'm using diederik's kraken geocoding logic right now, not sure if we will want to start with that, or scrap it and start with something new [22:47:50] haha [22:47:55] no! [22:48:00] well, yes. [22:48:12] but, jajajaaj [22:48:16] just playing right now! [22:48:21] GEEZ GET OFF MY BACK [22:48:22] That code had a few issues with which databases it used IIRC. [22:48:28] oh yeha, totally [22:48:30] * qchris shuts up. [22:48:36] SHUT UP QCHRIS!!!!! [22:48:38] haha [22:48:42] no, keep talking, you say my typo! [22:48:53] but, that code does some tricky stuff with IPv6 and continents [22:48:56] which we might want to keep around [22:48:57] we will see [22:49:01] Sure. [22:54:58] Ironholds: still around? [22:55:21] ottomata, whatever you've discovered I deny participating in it or being aware of it [22:55:32] how was I to know Oracle salesmen were allergic to thermite? [22:55:39] I'm not that kind of scientist, dammit! [22:55:51] (also yes, I'm here) [22:56:04] (PS1) Nuria: Cleaning up before doing anything in the build [analytics/dashiki] - https://gerrit.wikimedia.org/r/164254 [22:57:09] haha [22:58:15] Ironholds: https://gist.github.com/ottomata/8ff13399c6b2a6acc0cf [22:58:17] try that on stat1002 [22:59:20] in hive, presumably ;p [22:59:25] yup [22:59:38] ytics.refinery.hive.GeocodeCountryUDF'; [22:59:38] FAILED: Class org.wikimedia.analytics.refinery.hive.GeocodeCountryUDF not found [22:59:38] FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask [23:00:41] ottomata, ^ [23:01:21] ottomata: The error message for me is "cannot open directory /home/otto/scr: Permission denied" [23:01:33] Which seems fair, as /home/otto/scr is 700. [23:02:00] oh [23:02:10] ok [23:02:11] weird [23:02:12] try again [23:02:34] Works now for me :-) [23:02:41] hive started the job. [23:03:10] hum [23:03:30] yep, running [23:03:32] holy sweet jesus [23:03:37] Worked! [23:03:42] \o/ [23:03:44] it's like god looked down and saw I was utterly miserable and without hope [23:03:46] and sent me ottomata [23:03:54] well done, god. Well done. [23:04:11] k, pretty sure that is mostly good, even though the implementation is a little hacky for my taste [23:04:26] i will not vouch for the x_forwarded_for logic though [23:04:48] if given a non empty xff, it will return the first IP it finds that is not an internal network IP. [23:05:15] if all xff IPs are internal, it will return the regular ip field [23:05:23] you can use it without xff if you like, Ironholds [23:05:28] just don't pass it the second arg [23:06:57] ottomata, how did you work out the internal-ness? [23:07:22] oh, just internal networks in general, not ours [23:07:36] oh, and I only checked IPv4. [23:07:37] hm [23:07:48] https://en.wikipedia.org/wiki/Private_network#Private_IPv4_address_spaces [23:08:07] ha, actually, think i'm not checking 172s [23:08:47] aha [23:08:55] hmn. So, problem, then [23:09:10] ip field has our internal ip. XFF has an external IP. [23:09:15] Ironholds: this crap: [23:09:15] https://gist.github.com/ottomata/271765fe5a5898b72bcb [23:09:16] I'll call it a day, you Java haxxors! [23:09:20] See you tomorrow. [23:09:32] what's the outcome? [23:09:37] aha [23:09:38] ooh [23:09:44] Ironholds: if XFF is not "-", the first thing that is not an internal network there will win [23:09:52] so, IP will win, or XFF will? [23:10:01] if IP, we need to build in recognition for our machines. [23:10:09] XFF will win if it has a non internal IP [23:10:30] awesome [23:10:34] that sounds perfect :) [23:10:38] ooh. Proposed amendment? [23:10:48] haha, you probably won't get it today :) [23:10:51] https://github.com/Ironholds/WMUtils/blob/master/R/geo_country.R#L42 [23:10:53] it's very simple [23:11:01] ah [23:11:04] i think we might cover that... [23:11:05] TL;DR maxmind outputs certain things that are not actually ISO codes that resolve to country level [23:11:18] they may need to be handled to avoid grief further down the line. [23:11:35] basically using this code [23:11:35] https://github.com/wikimedia/kraken/blob/master/kraken-generic/src/main/java/org/wikimedia/analytics/kraken/geo/GeoIpLookup.java#L107 [23:11:48] hmm, no [23:11:51] those are continent [23:12:39] 109 to 111 is perfect [23:13:01] but I worry about resolving to continents. It's going to require just as much hackery on the research end, to make the dataset work at country level, as not sanitising the results. [23:13:23] Ironholds: that work has mostly already been done [23:13:25] a long time ago [23:13:38] https://github.com/wikimedia/kraken/blob/master/kraken-generic/src/main/resources/country-codes.json [23:13:50] ow my eyes [23:13:56] hha [23:13:59] still; cool! [23:15:44] fifa? [23:17:24] I wouldn't trust their data [23:17:30] people could pay them to make it say whatever they want. [23:22:13] (PS1) Ottomata: [WIP] Import GeoIpLookup code from Kraken repository, add Hive UDF to geocode country [analytics/refinery/source] (otto-geo) - https://gerrit.wikimedia.org/r/164264 [23:22:51] (PS1) Ottomata: [WIP] Import GeoIpLookup code from Kraken repository, add Hive UDF to geocode country [analytics/refinery/source] - https://gerrit.wikimedia.org/r/164266 [23:23:22] (Abandoned) Ottomata: [WIP] Import GeoIpLookup code from Kraken repository, add Hive UDF to geocode country [analytics/refinery/source] - https://gerrit.wikimedia.org/r/164266 (owner: Ottomata)