[17:58:32] brb lunch [18:03:11] average, milimetric|lunch: https://gist.github.com/ottomata/8477506#file-gistfile1-txt [18:04:37] wow ottomata, so basically like nobody [18:04:43] well, a few people [18:04:57] lol, wth is "Soap qwerty" [18:05:28] ottomata: you might wanna use OSfamily(the other UDF) for vendor, major,minor . you might get more accurate results [18:05:31] but pretty awesome [18:05:48] we could probably elimiate even these if we treated nulls in any fields as nulls in all fields [18:05:53] or something like that [18:05:57] yeah, but that is bad news, rigth? [18:06:02] oh wait [18:06:03] no [18:06:06] sorry got it backwards [18:06:06] yeah ok [18:06:08] yep :) [18:06:14] oo, it would be even more awesome to see the distribution! [18:06:21] group by count [18:06:23] eh? [18:06:49] average, since right now we are kinda talking about using devices, i'd like to keep using it, but hmmm [18:06:53] average says I can use both [18:06:57] you think I should use both udfs then? [18:07:02] yes [18:07:03] can I do that as is right now, or do I need something extra [18:08:14] CREATE TEMPORARY FUNCTION ua1 AS 'org.wikimedia.analytics.kraken.hive.OSfamily'; [18:08:33] and use ua1 around line 21 [18:08:55] then replace lines 15,16,17 with data coming from ua1 [18:09:02] and you're done [18:11:23] you should leave line 14 as is, since dclass is very good at getting the model [18:12:23] ok [18:12:23] https://gist.github.com/ottomata/8477506 [18:12:26] does that look ok? [18:12:43] ah! [18:12:46] myfuncs are backwards [18:13:35] ottomata: yeah, but apart from that, it looks good [18:14:26] dclass is very good at device detection, it has almost 3000 patterns in its DB with the latest packages you're using [18:17:55] hmm, milimetric|lunch if we are going to puppetize wikimetrics and use it in mediawiki vagrant [18:18:00] i think we shoudl make it into a submodule [18:18:08] so it can be used in both places more easily [18:18:13] that would give it its own repository [18:18:26] operations/puppet/wikimetrics [18:23:11] milimetric|lunch: i'm in batcave, have qs for you [18:23:13] whenever you are ready [18:40:15] nuria: around? [18:57:46] nuria nuria nuria [18:57:57] hola ori [18:58:12] there's an outage with our network provider in tampa [18:58:21] so various things are down [18:58:42] i don't think anything is affecting eventlogging specifically but i wouldn't be surprised if there was something i did not think of [18:58:53] because there are misc pieces of infrastructure there [18:59:03] anyways, TL;DR, since there's an outage anyway we might as well do the db migration [19:00:11] what do you say? [19:01:40] christian feels like we should not move with adding the column yet [19:02:15] until we have presented the sanitization strategy to the comunity [19:03:08] ah, hm -- he's probably right [19:04:06] yay, no analytics slaves! [19:04:11] everyone go crazy and set fire to things. [19:08:22] actually it would have been such a good opportunity [19:08:35] risk free as you cannot make things worst ... [19:12:42] well, [19:12:45] we could just add the column [19:12:48] * qchris feel guilty :-) [19:12:56] but not enable writing anything to it [19:13:24] so if the plan is substantially revised, we can drop the empty column [19:13:24] but if not, it's good to go [19:13:24] Sounds like a good plan to me ori. [19:14:27] cool. so: - email alert, - !log on #wikimedia-operations, - script? [19:15:15] eta for fix from provider was '1-2 hours' and that was 40 mins ago, so we should get cracking [19:21:27] nuria: ? [19:21:47] really? [19:21:48] taht sounds great [19:21:53] *that [19:22:16] how should i help? [19:22:40] only that i have to leave my co-working in 30 mins but hey, [19:22:47] let's live on the edge [19:22:57] :D [19:23:41] okay, i can run the script then. do you want to send the emails? [19:24:11] I would ssnd the e-mail to the event logging subscribers [19:24:17] plus notify channel [19:24:46] email analytics too [19:24:47] I will tell otto before... [19:27:39] ottomata: ^ [19:28:06] hiii [19:28:06] wha? [19:28:53] ok [19:28:59] how can I help? [19:29:05] oh boy, big db migration on friday afternoon, right!? [19:29:08] we sure we want to do this? [19:29:22] ottomata: yeah, it's not a big deal really [19:29:28] :)) [19:32:00] is it a long running alter? [19:32:00] or just real quick [19:32:00] ? [19:36:59] just alter [19:48:31] nuria, halfak: sorry for disconnecting right after the meeting [19:48:31] I believe we have two outstanding action items [19:48:31] ok average, milimetric [19:48:31] check it [19:48:31] https://gist.github.com/ottomata/8477506 [19:48:31] probabyl should reorder that [19:48:31] 1) work on use cases for B1 and B2 (and loop in Mobile) to determine what fields we're talking about [19:48:32] uhhh [19:48:32] i'm not sure how to read that [19:48:32] crap [19:48:32] that looks weird [19:48:32] 2) go back to Legal with these use cases so they can help us determine if they should be included in the PII definition [19:48:32] ok, i will have to come back to that [19:48:32] hm [19:48:32] DarTar sounds good [19:48:32] 3) include some use cases for A (they are more likely to come from you guys than from Research and Data) [19:48:33] I can kick off (1) with folks in Product + Mobile [19:48:33] i also have to send luis some examples, i put them here: https://gist.github.com/nuria/8476742 [19:48:33] A use cases is ops but ottomata has talked about those before, DOS attack, basic sanity checking [19:48:34] well, we need to write them down [19:48:34] but, that's ok, because we intend to keep some raw data for a limited period of time [19:48:34] and that's all ops needs [19:48:34] so ja [19:48:34] and, that mostly is relevant for webrequest [19:48:34] more than eventlogging [19:48:34] ottomata: tnegrin is actually suggesting the possibility that we do sanitization before keeping them raw [19:48:35] and that's what you guys need to figure out [19:48:35] not on hadoop [19:48:35] that's not a good solution initially -- we will retain, then santize [19:48:35] sanitize [19:48:36] I thought there were objections about the idea of even storing unsanitized data [19:48:37] in any case, it would be useful to have more details about the ops use cases [19:48:37] because I cannot speak for them with Legal [19:48:37] DarTar: How do you want to proceed with use-case C? [19:48:37] I think we can sanitize as it goes into haddop right? ottomata and i were talking about this last time [19:49:10] yes -- but we don't want to do that [19:49:20] my original goal was (if possible) to decouple A and B from C because C is really tricky (per Luis) [19:49:48] so the sooner we have a solution for A and B, the sooner we can move on with an implementation that Legal is ok with [19:49:53] DarTar, do you want to rework this section? https://office.wikimedia.org/wiki/Analytics/Internal/UserAgentSanitization#Use_of_user_agent [19:50:11] yay to your prior comment [19:50:43] DarTar: Sounds good. Shall we keep use case C on hold for now then? [19:51:25] nuria: yes, but maybe it makes more sense to refer to information extracted from UAs instead of sanitized UA [19:51:33] halfak: that would be my preference [19:51:44] Bummer. OK. [19:52:08] sorry dude, but addressing that one is really a big deal [19:52:14] is that blocking anything in the short term? [19:52:34] It's been blocking a project for a while, but I have other projects. [19:53:57] B is blocking projects including the new VE tests, flow analytics and even the fact that we should change what mobile is collecting [19:53:59] so I'd like to give it priority over C [19:54:32] which doesn't mean we should stop investigating the implications of the two options you guys articulated [19:54:46] That's fine. No arguments. [19:56:32] given that all use cases in B are based on EventLogging does it make sense for me to hack this public page instead? https://www.mediawiki.org/wiki/EventLogging/UserAgentAnonymization [19:56:33] any objection? [19:57:05] also clarifying the terminology (anonymize vs sanitize or extract) [19:57:21] per tnegrin, I think it's important [19:59:57] My vote goes to "extract", but I think that the majority preferred "sanitize". [20:00:21] Where extract means (Device, OS, Browser, Version). [20:01:04] I just moved that page to "sanitization" [20:01:04] https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization [20:01:04] if it does not sound good let me know [20:04:34] putting together a quick doc to collect requirements, sharing it in a moment with you guys for review [20:04:44] Yes, beacuse extract is used for processing [20:04:48] and we are not processing [20:05:07] rather "removing chuncks" [20:05:19] *chunks [20:05:56] So between extract and sanitize i much prefer sanitize [20:06:02] other options? [20:06:51] What's the practical difference between extract and summarize with regards to entropy? Also, what remains in the UA after "chunks" are removed? [20:07:11] (not necessary that you tell me now, but that should probably be in the doc) [20:08:17] halfak: module storage hard numbers bitte bitte :P [20:08:33] ori: I sent you an email. [20:08:39] I think I did anyway. [20:08:50] when? there was a partial mail outage earlier [20:08:57] due to tampa fiber cut [20:09:01] 9:30 PST [20:09:13] ori@wikimedia.org -- is that the right address? [20:09:16] yes [20:09:32] haven't gotten it [20:09:36] probably outage [20:09:40] re-send? [20:09:40] I just forwarded it again. [20:13:36] halfak [20:13:44] we can answer as to what we remove [20:14:02] you can se ethat here: https://gist.github.com/nuria/8476742 [20:14:17] what remains depends and the answer will be a probabilistic one [20:14:54] ori: did you get it now? [20:16:24] yep, reading [20:16:44] The overall row is weird. I'm looking into that now. [20:27:28] * ori nods [20:36:14] ori: Those numbers are good. [20:37:03] halfak: how could the average effect across mobile and desktop and be larger than the average effect on either type of platform? [20:37:09] i am misreading this, aren't i? [20:37:58] The issue comes from the bimodal distribution that we get when we combine mobile and desktop together. [20:39:07] It makes the comparison weak to differences in counts between mobile and desktop. [20:41:26] Somehow, we have a slightly higher proportion of desktop users in the test condition. [20:42:07] that is very interesting [20:43:15] Test: 29.23% mobile vs. Control: 29.33% mobile [20:43:42] I bet if I resample them down this will be more reasonable. Let me try that quickly. [20:43:42] so, I thought I'd be able to do this myself but I'm worried I'd misrepresent the results, so could I ask you one more thing -- namely, for a concise single-sentence(ish) statement of the overall result? [20:43:56] kk [20:44:14] MS reduces load times overall by about 1/10th of a second. [20:44:56] This effect is consistent between mobile and desktop. [20:45:02] typically the unit of measurement for performance is miliseconds, so could i say 139ms instead? [20:46:11] I'd saw 100ms. I don't believe that the "overall" rows in that table are a fair comparison. [20:46:35] s/saw/say [20:47:37] OK, very cool. This is great. [20:48:36] :) [20:50:05] faidon mentioned that the Foundation spends a chunk of change on optional network infrastructure that provides benefits of that order of magnitude, so it's not too shabby. I'm really pleased! Thanks very much. We should still do a full-write up, I can help filling in details [20:50:33] I agree. I'm going to be able to block off time mid next week. [20:50:34] I think I should provide a more concise technical statement before diving into details [20:50:51] On the writeup? [20:50:58] Like a TL;DR:? [20:51:31] yeah, I re-read the couple of paragraphs of technical context that I added and I thought that they were missing the forest for the trees a bit [21:20:48] halfak: sorry, I forgot to ask something -- the effect size was esp. pronounced on one popular platform, right? Chrome, IIRC. would it be hard to generate the EV for MS on that? [21:24:03] halfak: bbiab, but <3 if it's possible to generate that [22:34:06] ottomata, you around? [22:34:09] damn [22:55:11] (PS1) Milimetric: dependency install script, handles httplib [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/108160 [22:56:07] Ironholds: stuck on anything? [22:56:26] milimetric, I have a feature request and don't know who to get it to [22:56:44] namely: we need to start quoting the fields in our logs [22:57:04] I'm sat here looking at a single, 6MB field titled 'UA' because some moron browser designer put tabs in his string [22:57:25] and in the absence of being able to hunt him down and make him browse the internet on an experimental early IE release, we should probably quote ;p [22:57:29] gotcha [22:57:29] the thread from earlier [22:57:37] indeedy. So, who do I need to buy cake to make this happen? ;p [22:58:09] I think cake is not needed but I'd just shoot tnegrin the request [22:58:20] makes sense. [22:58:22] then it'll either be considered a production issue [22:58:26] and those get done asap [22:58:31] or not, and those get done less asap [22:58:39] I mean, if we're shifting to varnishkafka as a storage mechanism I'd hope this'd be solved for by that [22:58:44] I'm sorry -- there will be no cake [22:58:57] but I don't know where VK is getting the data, and so have no idea if the problem will just be replicated there, or something. [22:59:09] can you file a bug? [22:59:19] happily; bugzilla or RT? [23:00:19] bugzilla Ironholds [23:00:26] sweet [23:00:45] also, varnishkafka is deployed and fully functional on the mobile varnishes [23:00:55] and it relays stuff via proper json [23:01:11] so tabs no longer cause problems [23:01:39] the result of that pipeline is available in the webrequest_mobile table of the wmf database on the hadoop cluster [23:02:17] so the full firehose will behave the saem [23:02:20] and json == quoted. yay! [23:02:35] indeed, but until VK has non-mobile data... [23:03:14] right, bug still applies, definitely [23:03:48] I'd love to know how erik z reads his files in [23:04:12] my worry is that if it's in a tab-sensitive way (i'd imagine it'd have to be, since they're the separators) we're looking at a data loss here :/ [23:04:16] anyway. bug filing. [23:05:50] https://bugzilla.wikimedia.org/show_bug.cgi?id=60184 [23:06:42] Ironholds: I believe Erik discards lines which have more fields than the expected number of fields, but IIRC tabs in ua's are quite uncommon so it should not be a big deal to miss those lines [23:07:29] drdee, indeed, but still to be avoided (particularly when it.../should/ be a relatively simple fix. I say should because I only know how to solve for it in nginx and I know we use a weird custom udp2log..thing.) [23:08:15] well you would have to update all the scripts, webstatscollector and udp-filter that parse those log files :)