[00:00:28] drdee: thanks. The tar file doesn't seem to have any docs about the C Api but has a test directory with some example code; I'll look at those ... [00:00:53] ok [00:11:33] drdee: What is the expected combination of options for geocoding ? I notice that the config file on oxygen does not use geocoding at all. [00:11:57] -g -b country [00:12:01] (i belief) [00:12:10] try man udp-filter [00:14:12] Stefan used "-g -b country" for his profiling but for this case, some important parts of the code are skipped due to the commented out code that we briefly discussed a few days ago: //params[GEO_FILTER] = true; [00:14:53] git blame shows this (as Stefan discovered): 1ca09843 (Diederik 2012-03-15 11:28:47 -0700 1229) //params[GEO_FILTER] = 1; [00:15:08] So all roads lead back to you :-) [00:16:08] params[GEO_FILTER] is only set if the -c option is provided. [00:20:01] mmmmmm [00:20:12] that's a very long time ago [00:20:50] true. [00:21:15] Are we currently using it in production ? [00:21:26] geocoding I mean. [00:22:00] yes [00:22:05] although... [00:22:06] maybe not [00:22:12] check puppet :) [00:23:00] there are only 3 machines doing this stuff right ? emery, oxygen, and ? [00:24:57] I'm seeing this on emery: pipe 10 /usr/bin/udp-filter -F '\t' -p _NARA_ -g -m -b country .... [00:26:20] which is also confusing since the -m option specifies an alternative path to the db and so _requires_ an argument ! So maybe it is running some older version of the code ? [00:28:12] you got to ask ottomata :( [00:28:28] the final machine that runs this is gadolinium [00:31:19] That machine, like oxygen, does not use -g or -b [00:32:41] mmmm, it almost sounds like some dev work :( [00:34:05] OK, I'll send email with my thoughts about this. [00:40:27] drdee: one more question: we have both GeoIPv6.dat and GeoIP.dat under /usr/share but the code seems to use only one; do you know if we support geocoding of IPv6 ? [00:40:50] it should support not sure if we have fully implemented it [00:41:10] ok [00:41:24] ottomata: are you around ? [04:25:17] New review: Ottomata; "Maybe I'm reading this wrong, but can we but this in scripts/? (if it isn't already there.)" [analytics/E3Analysis] (master) - https://gerrit.wikimedia.org/r/60930 [04:57:50] New patchset: Stefan.petrea; "Fixed bugs, cleanup, added info to data.json" [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/60963 [04:59:22] New patchset: Stefan.petrea; "Fixed bugs, cleanup, added info to data.json" [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/60963 [05:01:43] New patchset: Stefan.petrea; "Fixed bugs, cleanup, added info to data.json" [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/60963 [05:07:18] New patchset: Stefan.petrea; "Fixed bugs, cleanup, added info to data.json" [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/60963 [05:08:28] Change merged: Stefan.petrea; [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/60963 [06:18:52] New patchset: Diederik; "Added simple script to create admin account through CLI" [analytics/E3Analysis] (master) - https://gerrit.wikimedia.org/r/60930 [06:56:08] New review: Diederik; "Ok." [analytics/E3Analysis] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/60930 [06:56:09] Change merged: Diederik; [analytics/E3Analysis] (master) - https://gerrit.wikimedia.org/r/60930 [06:57:20] New patchset: Diederik; "Use os.path.join instead of string concatenation to create paths. This is required to make the user_metrics puppet module work" [analytics/E3Analysis] (master) - https://gerrit.wikimedia.org/r/60966 [08:24:36] average: hey :-) [08:41:08] hashar: hi [08:41:43] average: so wikistats [08:41:59] average: I have noticed there is a puppet class for it. So I guess it is installed via that puppet class [08:42:06] but I could not find the perl module dependencies there [08:43:44] hashar: well, what class is that ? [08:43:53] hashar: did you look at the gerrit changeset I sent in the e-mail ? [08:44:00] yeah [08:44:00] all deps are mentioned in the gerrit changeset [08:44:47] so the packages are installed for the continuous integration server [08:44:57] in a class I want to phase out contint::packages [08:45:07] I would like to move those dependencies to the wikistats class instead :D [08:45:18] and then include the new wikistats class from contint [08:46:11] need to look at the wikistats class and talk about it with ops I guess [08:49:34] average: I will amend the change [08:53:45] no problem [08:53:50] hashar: thanks [08:56:04] average: done at https://gerrit.wikimedia.org/r/#/c/60965/2 :) [08:57:23] average: you will need someone from the ops team to review / merge that change :D [12:14:48] morning everyone [12:16:06] milimetric: morning [12:16:17] :) [14:33:34] New patchset: Stefan.petrea; "Squashed commit of the following:" [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/61005 [14:35:15] Change merged: Stefan.petrea; [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/61005 [14:42:55] morning ori-l [14:42:56] you there? [14:43:00] got a vagrant q for you [14:59:36] drdee: the report for #60 is running once more, it's completely automatized, in cron now [14:59:52] drdee: tests for New mobile pageviews are running in Jenkins [15:00:06] drdee: we can do the Five Whys whenever you want [15:02:03] drdee: I will continue work on #353 [15:02:34] * average is coffeine powered [15:17:21] nice! [15:45:55] New review: Milimetric; "os.path.join ftw!" [analytics/E3Analysis] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/60966 [15:45:55] Change merged: Milimetric; [analytics/E3Analysis] (master) - https://gerrit.wikimedia.org/r/60966 [15:46:30] Otttoo who bist du? [15:47:40] ottomata ^^ [15:48:03] yoyoooo [15:48:06] who oder wo? [15:48:37] New review: Erosen; "looks good." [analytics/E3Analysis] (master) - https://gerrit.wikimedia.org/r/60966 [15:56:43] ottomata, [15:56:54] i think we need to add the mwdumper gerrit repo to the user_metrics api [15:57:08] oh to run your script? [15:57:09] and i can whip up a script to create the jar, download the dump file and run the import [15:57:09] as a submodule? [15:57:10] sounds good? [15:57:22] i was thinking as a submodule [15:57:26] other thoughts? [15:57:30] makes sense, you should ask erosen and milimetric though [15:57:39] also, you probably only want to clone it if you want to run that script, right? [15:58:04] well, actually, that's fine [15:58:07] yeah add it as a submodule [15:58:08] submodule ftw! [15:58:16] FTW!!! [15:58:18] people won't get it unless they manual run git submodule init + update [15:58:24] yeah [15:58:38] ohhh but that also means java needs to be installed L( [15:58:39] :( [15:58:40] an alternative would be to add it to the vagrant instance with puppet or something, right? [15:59:18] i guess so, but it would be nice if it was possible for someone to set this up without vagrant too [15:59:21] i dunno [16:00:03] yea, but wouldn't a puppetization do this? [16:00:30] basically if there is no simple install command, we have to make one, right? [16:01:32] i was thinking that all should be part of the puppet user_metrics manifest [16:01:33] okay [16:01:36] relocating to office [16:01:37] brb [16:06:09] hey erosen, cool python question [16:06:14] sup? [16:06:20] i'm trying to deduplicate that list of users [16:06:31] the optimal way I can think of isn't very pythonic [16:06:34] hehe [16:06:37] but all the pythonic ways I keep finding aren't optima [16:06:55] hmm [16:06:58] so it doesn't matter too much but just wondering if you had some cool way of doing it [16:07:04] what is your current approach? [16:07:14] cat user-list1 user-list2 | sort | uniq [16:07:15] done [16:07:20] :D [16:07:24] like deduplicate(arbitrary_list_of_objects, lambda x: x['key_to_deduplicate']) or something [16:07:26] hehe [16:07:56] so my current approach is: iterate through all the objects putting them in a dictionary if they're not already there, then return a list of all the values in the dictionary [16:08:17] that seems pretty reasonable [16:08:27] the other options that come to mind are: [16:09:06] [dict(items) for items in set([d.items() for d in ds])] [16:09:19] or: [16:09:55] but the dict / set approach doesn't have a way of specifying what key to make the set/dict on right? [16:09:59] that's what i looked at first [16:10:02] [group[0] for group in itertools.groupby(ds, key=dict.items())] [16:10:16] milimetric: yup [16:10:20] or use a bloom filter [16:10:22] so it's faster [16:10:30] the set approach doesn't let you have different keys [16:10:34] yeah, the itertools groupby requires it to be sorted on the key so it's much slower than the approach i was gonna take [16:10:41] yeah [16:11:00] hm, what do you mean different keys? I couldn't find a way to specify keys at all with that [16:11:16] and what's a bloom filter average? [16:11:30] milimetric: it's a probabilistic data structure [16:11:33] it's a super fancy hashing solution [16:11:38] oh ok [16:11:48] it's really fast for large stuff [16:11:48] superfancy might be overkill :) [16:11:59] faster than O(n)? [16:12:16] milimetric: i think I was agreeing with you, that you can't specify an arbitrary key, with the set approach [16:12:25] oh ok, misunderstood [16:12:26] thanks guys [16:12:42] seems like someone would've written this in the standard lib somewhere! :) [16:12:55] c# man - i'm gonna make you all mono developers someday :) [16:14:16] hehe [16:15:03] mornin [16:17:06] milimetric: how about F# ? [16:17:11] :D [16:17:36] milimetric: you've probably already seen this, but this seems relevant http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order [16:17:38] yeah, the C# that I'd use is called Linq [16:19:15] hm, average, this part of the filter clause makes me think that my approach is faster: "if x not in seen" [16:19:23] like, that makes it O(n^2) [16:19:30] whereas mine's O(n) [16:20:21] milimetric: you're trying to unique a list? [16:20:32] dschoon: deduplicate 2 lists I think [16:20:33] yeah, i figured there'd be some cool way to do it [16:20:38] but i'm just looping over stuff :) [16:20:41] no just one list [16:20:46] it's of complex objects [16:20:47] and order matters? [16:20:50] nope [16:20:55] of teh result? [16:20:56] it doesn't? [16:21:02] then the easiest answer is: [16:21:10] list(set(L)) [16:21:13] nope [16:21:14] :) [16:21:18] they're complex objects [16:21:20] so here's my solutoin [16:21:35] so you're saying they're not hashable? [16:21:45] but they *do* have equality? [16:22:08] I'd attach the problem with a PyBloom [16:22:13] *attack [16:22:16] lol [16:22:20] https://gist.github.com/milimetric/5468518 [16:22:24] dschoon: :P [16:22:43] milimetric: are you sure O(n^2) ? [16:23:01] oh, you're right - it's just a set check [16:23:09] well... no [16:23:16] milimetric: can you cache that function call ? [16:23:25] yeah, that set check would be O(n) [16:23:28] milimetric: you have a function call that you can cache(or .. memoize) [16:23:34] :P [16:23:37] mine is still better: [16:23:46] your keyfunc makes it easy [16:24:01] i don't like mine, just can't find how to use my keyfunc with other stuff [16:27:59] actually, this is for UM API, right? [16:28:11] you should just implement __hash__ on the objects [16:28:29] then you can use set() [16:29:01] ah yeah, definitely [16:29:01] http://docs.python.org/2/reference/datamodel.html#object.__hash__ [16:29:10] but we wanted to wait on that [16:29:16] until we moved it to some sort of ORM [16:29:26] *shrug* [16:29:30] 'cause making some things objects and other things random tuples would be confusing [16:29:36] sure. [16:29:36] but yeah, i agree - that's the best solution [16:32:41] can I ask how the objects look like ? [16:32:54] and on what attributes they are compared ? [16:34:22] https://gist.github.com/dsc/5468593 ^^ milimetric [16:35:09] the yield(o) is nice [16:35:20] but isn't "k in keys" inefficient? [16:35:22] using an iterator there means you have to type the resulting collection [16:35:27] i use a set. [16:35:32] it's O(1) to check [16:35:38] oh ok, didn't know that [16:35:41] it's a set! [16:35:45] that's the whole point :) [16:35:51] yeah, i figured but didn't wanna assume [16:36:07] so when you use it you have to go: [16:36:18] uL = list(uniqued(L)) [16:36:24] yep [16:36:41] but if you're already in a loop, it faster [16:36:52] for o in uniqued(L): [16:37:54] milimetric: you might find some of this useful. https://github.com/dsc/py-lessly/blob/master/lessly/collect/tools.py [16:40:08] https://github.com/dsc/py-lessly/blob/master/lessly/misc.py [16:40:55] hm, cool [16:41:25] that's my random collection of utilities and such that often find their way into projects [16:41:32] especially invoke() and pluck() [16:58:32] ottomata: hey [16:58:44] yoyo [16:58:44] what's up? [16:58:50] i solved my thing but, [16:58:58] fqdn is not available via facter apparently [16:59:11] i tried setting it manually in Vagrantfile, but that didn't work [16:59:16] i ended having to set facter.domain [16:59:27] that made fqdn == hostname.domain [17:00:06] facter? i hardly.. ahem. [17:00:14] sure, that works [17:01:41] well, sometimes i need fqdn for puppet templates, ja know? [17:03:19] yeah, i've noticed that in addition to built-in functions there's a set that puppetlabs unhelpfully calls "stdlib" [17:03:36] unhelpfully because all they appear to me by that is "it's a library we use often" [17:03:38] average: scrum? [17:03:44] * to mean [17:04:01] there is a stdlib module i think, o rosmething, no? [17:04:18] ja, and we have it in ops/puppet [17:04:26] right, but it's not really built-in [17:04:41] it's standard as in "we standardly use this", except not everyone does [17:04:51] and it exists in parallel with a set of functions that *are* built in [17:05:04] https://github.com/puppetlabs/puppetlabs-stdlib [17:05:33] aye [17:06:06] average on your way? [17:07:56] dschoon: {keyfunc(o): o for o in it}.itervalues() [17:08:15] ! [17:08:18] beautiful [17:08:23] i totally had not thought of that [17:08:29] very very nice, mr ori-l [17:08:35] :) [17:17:18] ori-l has a one-liner for everything :) [17:53:15] milimetric: http://etherpad.wikimedia.org/analytics163 [17:53:32] Check that out for the agenda I'm putting together. Happy to chat now for the next 5 minutes if you want. [17:54:06] sure kraigparkinson, i'll come to hangout [18:04:49] New review: Yuvipanda; "Also why the empty files under datasources, datafiles? If you just want the folders, you can test fo..." [analytics/limn-mobile-data] (master) C: -1; - https://gerrit.wikimedia.org/r/60608 [18:08:31] ok erosen! [18:08:33] what's doin? [18:08:54] in a meeting .... [18:08:55] one sec [18:08:59] or rather 1h? [18:09:00] k [18:09:03] hmm, ja sure [18:49:25] New patchset: Ram; "Some cleanup:" [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/61049 [18:51:55] hey jgonera [18:52:00] hey [18:52:09] milimetric: ori-l: jgonera was wondering about getting limnpy running puppetized on our servers [18:52:17] and he says that he's made debs before... [18:52:28] don't get too excited, one deb, and a long time ago ;) [18:52:31] oh, yeah, that'd be cool [18:52:41] i don't think we need debs for it [18:52:50] but you'd probably have to puppetize it yourselves [18:53:07] hm [18:53:12] milimetric: the deb we had was too old [18:53:18] i know next to nothing about puppet [18:54:10] i was talking to erosen about it. limnpy used features of pandas that were not present in the version available in the repo [18:54:13] which was a problem [18:54:33] so what's pandas, and where's its source code? [18:55:01] and what version does limnpy depend on [18:55:10] drdee: I just pushed a cleanup change to udp-filters. [18:55:24] ok, have it, 0.9 [18:55:40] jgonera: there's a gerrit change for it somewhere [18:55:41] let me look [18:56:00] xyzram: got it! [18:56:03] jgonera: https://gerrit.wikimedia.org/r/#/c/54116/ [18:56:46] raring has pandas 0.10 [18:57:49] xyzram; so the anonymization and geocoding have move to different files/ [18:57:58] ottomata: i fixed the packetloss job [18:58:08] jgonera: but we're not on raring [18:58:21] YES! [18:58:22] yeah? [18:58:23] awesome! [18:58:34] drdee: I made no change to anonymization and geocoding [18:58:46] ottomata: i also fixed the timestamps in it to please limn, and then set it to incrementally backfill from jan 01 [18:58:55] ottomata: http://localhost:8888/filebrowser/view/wmf/public/webrequest/loss/webrequest_loss.tsv?offset=246594&length=4096&compression=none&mode=text [18:59:09] ottomata: http://localhost:8888/filebrowser/view/wmf/data/webrequest/loss [18:59:10] we only have tabs since feb 01 [18:59:11] what version are we on, latest LTS? [18:59:29] i made the paths conform to hive partitions if you want to set one up [18:59:49] xyzram: the diff deletes a lot of code; see https://gerrit.wikimedia.org/r/#/c/61049/1 [18:59:54] jgonera: I think we're on precise [19:00:15] drdee: Two functions were commented out in udp-filter.c; I just removed those functions. [19:00:17] drdee, i think I moved the non-tab data out in to the archive dir [19:00:19] i think anyway [19:00:29] kool [19:00:37] YuviPanda, makes sense, that's the last LTS [19:00:47] drdee: So the code that was deleted was in fact commented out. [19:00:58] ok [19:01:15] YuviPanda, I'll put this on my not-urgent-todo ;) [19:01:20] jgonera: :) [19:01:52] drdee: It should also be slightly faster since it removes some unnecessary re-initialization in the core loop. [19:01:53] YuviPanda, do you know what exactly would make ops happy? Just a single deb for precise or a PPA? [19:02:09] jgonera: I'm not really sure :( [19:02:18] jgonera: I think LeslieCarr would be the best person to ask [19:02:18] drdee: I'll now be looking at improving performance. [19:02:21] xyzram: great! so would say let's start creating some benchmarks on locke, do you agree? [19:02:24] YuviPanda, ok, I'll ask around when/if I get to it [19:02:30] jgonera: :) [19:04:18] drdee: Yes, definitely. I have my own mini-benchmark of about 800MB (1.8M lines) captured from locke that I use for checking my changes. [19:08:11] brb lunch [19:13:02] ottomata, brain bounce [19:13:05] ? [19:13:16] yoyo sure, one sec [19:15:51] ok standup? [19:20:13] drdee ^ [19:21:00] shour [19:47:35] getting laundry, back in a bit [19:48:49] back [20:09:45] milimetric: i'm going to update your webrequest_mobile_platform job to fit with the other jobs [20:09:50] any objections? [20:10:05] no objections at all, if you could do it in one commit and send me the link that'd be appreciated [20:10:29] also - make sure you don't break anything :) [20:11:08] will do. [20:11:22] it fine if i change the output URLs? [20:11:28] have they been given out to anyone? [20:15:16] ^^ milimetric [20:15:30] tomasz is reviewing them [20:15:35] so yes [20:15:54] there's only one output url though [20:16:00] it compiles everything into a single tsv [20:16:16] so if you email him that as is right now, and tell him you're working on it and don't want it to be unavailable [20:16:18] i think that's fine [20:17:27] New patchset: JGonera; "Generate JSON files in datasources using templates" [analytics/limn-mobile-data] (master) - https://gerrit.wikimedia.org/r/60608 [20:19:20] i won't go away [20:19:29] i'll just email with a note when i update it [20:24:51] one way to validate that you haven't changed anything, btw, is probably to just re-run it and compare with the old tsv [20:48:41] milimetric: https://github.com/wikimedia/kraken/compare/2be0bbb1c3e1420b81434d10c8e50ad82d0cd218...HEAD [20:49:31] huh, cool [20:49:31] https://github.com/wikimedia/kraken/pulse/monthly [20:49:41] ottomata, drdee, milimetric ^^ that's new [20:49:51] in meeting [20:49:56] cool [20:50:34] https://github.com/wikimedia/limn/pulse/monthly [20:59:51] ooh fancy [21:00:24] milimetric: https://github.com/wikimedia/kraken/compare/2be0bbb1c3e1420b81434d10c8e50ad82d0cd218...c5cbbafff04e83c9b98994be4c8e255b9d8afbe2 [21:00:27] that's a better diff now [21:00:36] as moving the files causes diff to be silly [21:02:12] ottomata: /etc/udp2log/emery runs udp-filter with: pipe 10 /usr/bin/udp-filter -F '\t' -p _NARA_ -g -m -b country >> .... [21:02:32] but -m in the code needs a path argument. [21:02:53] Is a different version of the code deployed there ? [21:02:56] bwaa [21:02:59] hm [21:03:52] weird, i think it is a mistake [21:04:08] i guess the code probably checks if the provided geoip path is null and then uses default? [21:04:18] ah nope [21:04:19] totally broken [21:04:20] Pipe terminated, suspending output: /usr/bin/udp-filter -F '\t' -p _NARA_ -g -m -b country >> /a/log/webrequest/glam_nara.tsv.log [21:04:22] fixng.. [21:06:23] also, of the 3 production machines this is the only line using geocoding (-g and -b); is geocoding not a priority ? [21:08:04] drdee: ping [21:08:14] erosen: pong [21:08:22] datahub meeting [21:08:39] coming [21:08:42] cool [21:08:50] i think so, but, if we anonymize things, then we have to geocode first [21:09:05] we're doing some geocoding in kraken right now [21:09:09] but if we anon the IPs then we can't do that anymore [21:09:58] kraken is further downstream from the 3 logging machines ? [21:12:49] All the filters on oxygen use IP filtering only (-i) [21:44:45] New patchset: JGonera; "Add a 30-day moving average of the rendering time graph" [analytics/limn-mobile-data] (master) - https://gerrit.wikimedia.org/r/60614 [21:47:54] jgonera: the moving average feature is actually built in [21:48:00] it's called "smoothing" [21:49:28] ottomata, seed sql file tarred gzipped 788k. [21:49:46] happy? [21:51:22] das good! [21:51:27] what is it uncrompressed? [21:51:31] 2.3mb [21:51:35] hmmmmmmmmMMMMmmmm [21:51:43] still pretty big but I guess [21:51:43] please please please please [21:51:44] i mean [21:51:54] i already did a lot of extra work [21:51:56] it might be better to do it uncompressed, because then git can do its fancy stuff, and we can see diffs, right? [21:52:05] yahhhhhhh [21:52:15] q: does git no how to do its fancy stuff with binary files? [21:52:30] i guess it'd be hard for it to do with a zip file anyway, cause changes to it would cause it to zip differently [21:53:00] i think you need to ask milimetric if he wants a 2.3MB file committed to E3Analysis repo [21:53:07] if he's cool, i'm cool [21:55:48] ottomata: 979K Apr 26 21:55 seed.sql [21:55:55]