[12:58:10] yo average_drifter [13:14:00] average_drifter…. [13:25:27] hey drdee [13:25:32] morning [13:25:37] or afternoon [13:25:51] afternoon , thanks :) [13:26:36] ready to finish the webstatscollector? [13:27:24] yes [13:27:36] this is episodre 2, I think we will finish today [13:27:55] yes, it is always more painful than it seems [13:28:16] can you push your latest fixes to gerrit? [13:28:21] (new patch set) [13:28:45] right now don't have any, but working on those commandline switches [13:28:51] reading the page you gave me also [13:29:20] i like it, you just tag a branch [13:29:35] and then it generates nice version number [13:29:42] i will add it to debianize.sh [13:30:24] ok [13:33:05] drdee: just read, that thing is very minimalistic and looks awesome [13:33:15] yes :) [13:33:18] I like minimalist stuff(using wmii as a window manager) [13:33:37] me too, i just need to add a tag to webstatscollector [13:33:38] that's all [13:36:04] should I git pull to get the latest debianize.sh ? [13:46:16] haven't pushed yet [14:03:30] hey milimetric [14:03:38] howdy :) [14:40:01] average_drifter: https://gerrit.wikimedia.org/r/25502 [14:40:50] run git pull [14:50:19] ran [14:58:57] k [14:59:31] average_drifter: [14:59:34] i run [14:59:35] git describe | awk -F'-g[0-9a-fA-F]+' '{print $1}' | [14:59:49] and i want to dash '-' replaced with a dot '.' [14:59:57] trying with sed, no luck so far [15:00:09] like git describe | awk -F'-g[0-9a-fA-F]+' '{print $1}' | xargs sed -i -e 's/\-/./g' [15:00:13] how to do this? [15:00:48] milimetric: this is totally f*** awesome: http://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/ [15:01:30] (i am still waiting for your linear regression implementation in JS, btw) :D :D :D [15:02:40] :) [15:03:29] http://dracoblue.net/dev/linear-least-squares-in-javascript/159/ [15:03:52] copying > doing :) [15:04:15] i can't read that article atm because I'm obsessed with d3 [15:04:25] sooooooooooooo cool :) [15:04:58] drdee: perl [15:05:06] drdee: I'm going to Perl that thing :) [15:05:06] that's a naive implementation [15:05:12] (OLS) [15:05:19] i rather see a matrix-algebra one [15:05:24] much faster :D [15:05:25] drdee: Marquardt ? [15:05:39] i was responding to milimetric [15:06:18] oh sorry [15:06:22] np [15:06:32] hm, wonder if anyone duplicated any matlab libraries in JS :) [15:08:05] close enough: http://www.jstat.org/ [15:08:54] you can write a small thing in C/C++ that uses industrial linear algebra libraries, and do the computations on server-side and just get the data back through AJAX.. [15:09:09] dunno if it's worth it [15:10:00] well, it's nice to have stuff like that client side [15:10:07] scalability becomes a non-issue [15:10:31] nah, that jstat is garbage right now, no matrices I can see [15:10:53] sorry for the distraction, i was just rambling [15:11:21] drdee, I'd be more than happy to bust out some linear algebra up in here. You can definitely add that to the list of "can do" when you're pondering how we can be useful to wikimedia and the community at large [15:11:41] in limn you mean? [15:11:50] sure, or as a standalone lib. [15:12:09] my dad and I are two people with an uncanny appreciation for jordanizing matrices :) [15:12:46] mmmmmm…… i like the idea, particularly if we could come up with 95% confidence intervals for projecting measures a couple of months into the future :D :D [15:13:23] and you jordanize matrices for breakfast with your father? [15:13:39] i used to back in college [15:13:49] most people bond over fishing [15:14:24] standalone lib is probably way to go [15:14:27] anyways [15:17:31] average_drifter: how far are you with final changes? [15:29:45] drdee: if you lose the xargs in the sed it works fine [15:38:59] drdee: mv webstatscollector_${MAIN_VERSION}_amd64.deb webstatscollector_${VERSION}_amd64.db [15:39:05] drdee: that's the last like of debianize.sh [15:39:08] drdee: is the .db a typo ? [15:41:29] drdee: around ? [15:45:24] yo [15:45:33] yes that's a typo [15:45:41] ok [15:45:44] I'm going through it [15:45:52] drdee: also, the architecture amd64 is hardcoded [15:46:05] drdee: can I make it detect what arch is available on the system ? [15:46:13] yes, please show me how to do that [15:46:19] drdee: ok [15:53:04] ersoen: check http://svn.mediawiki.org/viewvc/mediawiki/trunk/tools/wsor/ [15:53:13] that also contains a whole bunch of editor-focused scripts [16:17:25] who is working on http://stats.wikimedia.org? [16:21:49] Alchimista: Erik and me [16:21:52] Alchimista: and drdee [16:22:13] I think [16:22:17] yes [16:23:27] well, could you please include another output of the tables reports in json, or similar? Tables like this -> http://stats.wikimedia.org/EN/TablesWikipediaPT.htm [16:23:51] drdee: please discuss with drdee for a ticket on asana regarding this and it'll be done [16:24:37] hey ottomata! [16:24:43] survived the borked hdd [16:24:45] ungh, i am online, from a 2 week old backup, minus one solid state drive [16:24:58] Alchimista: there's no json in the link you posted [16:25:00] computer is still taken apart at the moment [16:25:12] Alchimista: can you please be more verbose :) [16:25:32] or it's possible I haven't seen it [16:26:11] making some lunch, then will put compy back together and be on in time for standup [16:27:48] ok [16:27:57] average_drifter: precisely.. well, the info on the page is very usefull, but to re-use it, the output isn't the best one. I need some of the info, s i was starting to write a py script to read the generated html, but watching at the perl scripts, it seems simple to generate another output with the info in json or similar [16:29:20] that way, to get the data, there where no need to do Web scraping [16:34:31] yeah, we could put it in the backlog but it isn' a high priority thing [16:36:13] i can get the data by webscrapping, until it's not avaiable, with no major problem. the html output has no changing plans, right? [16:51:35] mornin kids [16:52:24] morning [16:52:52] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [16:55:39] Error 337 (net::ERR_SPDY_PROTOCOL_ERROR): Unknown error [17:21:55] brb coffee [17:21:58] brb lunch [17:22:07] ah, timezones! [17:29:24] going to try a couple more things on compy, and put it back together, be back in a bit [17:31:02] average_drifter….ready [17:31:04] ? [17:34:22] drdee: I'm here, had a call with someone and replied to Joady [17:34:30] k [17:34:41] ready to push? [17:34:59] not yet [17:35:29] I also have a problem with lintian because I have perlbrew installed on my system [17:36:26] perlbrew is the equivalent of ruby-build/rbenv or virtualenv [17:36:57] so I have multiple Perl versions on my system(I have it because I do a lot of Perl dev and I need to switch between Perl versions and with/without debugging symbols) [17:37:27] I'm going to fix this [17:37:38] and afterwards I'll first git review the changes to debianize.sh [17:38:40] man, HDD is SO much slower than SSD! [17:38:54] I haven't worked from an HDD in like 2.5 years or something [17:47:08] average_drifter: don't worry about lintian [17:47:21] drdee, -Xms doesn't quite work for me... [17:47:25] Invalid initial heap size: -Xms=512m [17:47:32] seems to not matter what size I put [17:49:52] try without the '=' [17:49:58] like java -Xms128m -Xmx128m [17:52:22] sorry for the two incorrect emails, :( [17:53:24] doh, yup [17:53:37] its at least running, we'll see if it fixes anything... [17:54:17] naw, same deal [17:54:41] i didn't think it would solve the problem, [17:54:45] back [17:54:48] it's just better practise [17:54:48] aye yeah [17:54:50] ok [17:54:57] dschoon, can you help me figure out why I can't run this pig script? [17:55:00] back [17:55:03] sure [17:55:06] what's up? [17:55:15] on an04 [17:55:17] cat /home/otto/pig_1348768391053.log [17:55:28] aiight [17:56:48] yarr [17:56:51] all the hostkeys have changed [17:57:32] haha, yup [17:58:29] dschoon: [17:58:29] https://gist.github.com/3795405 [17:58:36] well [17:58:40] heapspace error [17:58:53] since these machines have a huge amount of ram [17:59:01] that means your script is doing something truly excessive [17:59:03] i guess? [17:59:14] i guess, mx and ms are set at 10G each [17:59:23] but i mean, shouldn't pig be smart about this? [17:59:31] it does work with smaller bits of data, so i guess not [17:59:49] `ssh-keygen -R $host` is what i usually use [17:59:58] aye, but mine matches anything [17:59:59] so I do [18:00:06] ssh-clear-host analytics [18:00:06] but pig should be smart about this [18:00:08] and it removed all of them [18:00:10] ah [18:00:11] yes. [18:00:13] probably it's an issue with the script [18:00:16] makes sense [18:00:18] probably [18:00:19] https://github.com/wmf-analytics/kraken/blob/master/src/pig/geocode_and_group_by_country.pig [18:00:26] i am a newbie pigger [18:01:01] as am i. [18:01:15] do we have the source code for 'akela-0.5-SNAPSHOT.jar' [18:01:45] it should be in the akela repo [18:01:50] /home/otto/akela [18:01:51] or [18:01:52] yeah [18:01:54] on github [18:02:04] https://github.com/mozilla-metrics/akela [18:02:18] https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/pig/eval/geoip/GeoIpLookup.java [18:02:46] drdee: please review https://gerrit.wikimedia.org/r/25522 [18:02:52] k [18:03:45] ottomata: in what cases *does* it work? [18:03:50] is the UDF emitting quickly enough the key value pairs? [18:04:05] if it holds too long to them it will obviously run out of memory [18:04:29] what does PARALLEL 28 mean? [18:04:46] because if it means what i think it does, you're trying to process 100% of the data on one machine earlier in the script [18:05:18] it is the number of reducers, and is a guess, hang on... [18:05:28] no, probably not [18:05:40] i'd hope not :) [18:07:54] average_drifter: merged https://gerrit.wikimedia.org/r/#/c/25522/ [18:08:18] http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause [18:08:34] http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause [18:08:42] yeah so put it back to 1 [18:08:43] drdee: you tried it right ? [18:08:55] average_drifter: no, i just read the code [18:09:00] but it looks good :) [18:09:19] drdee: :) [18:09:56] its not really even getting to the reduce phase [18:12:52] dschoon, for a good test run: [18:14:18] cd /home/otto [18:14:20] dse pig -p input=/user/otto/test0/sampled10000.log -p output=/tmp/country10000.0 -f ./geocode_and_group_by_country.pig [18:14:36] dse hadoop fs -cat /tmp/country10000.0/part* [18:15:13] hm. [18:15:21] and will this succeed or fail? [18:15:33] succeed [18:15:39] that's on a 10000 line file [18:15:39] hokay. gimme a few [18:15:41] so not much data [18:15:58] btw, I just ran that, so change your output dir [18:16:00] if you run it [18:29:46] average_drifter: elif [ $ARCHITECTURE != "x86_64" ]; the != ===> == [18:30:01] i'll fix it [18:33:03] alright [18:33:12] average_drifter: https://gerrit.wikimedia.org/r/25528 [18:34:18] thanks [18:36:37] ottomata: can you try "set pig.exec.nocombiner true;" and then run the job again? [18:38:37] where does one set that? [18:38:45] in pig latin [18:38:49] top of script [18:38:52] ah. [18:38:54] i will try it [18:39:27] and you can also disable the sort phase [18:39:32] in the script itself [18:39:51] that's also probably quite memory intensive [18:40:01] at the end? [18:40:04] yes [18:40:04] my sort statement? [18:40:06] yes [18:40:16] hmm, i dunno, really? there won't be many keys at that point [18:40:21] no more than the number of countries in the world [18:40:26] ohhh right [18:40:28] so what, 200-300? [18:40:32] nvm [18:40:43] i was thinking of a different key [18:40:52] you should probably also connect via jmx to one of the machines [18:40:54] and watch it run [18:41:01] watch the heap panel [18:41:05] hmmk [18:47:11] i need to read the pig docs. [18:47:24] i haven't done that in a long time, and things have changed quite a bit. [18:48:11] drdee it is running! [18:48:16] at least much longer than before [18:49:24] nice [18:50:18] but it might come with a performance penalty [18:52:21] and are you running this on top of dse? cause that might also introduce some unknowns [18:53:28] hm. that's a good point. [18:54:33] i'd be curious [18:54:49] maybe we should try running a pig script that reads in those input files you selected before [18:54:57] and just sorts them [18:55:04] see if it can do that without OOMing [18:55:11] that would isolate whether it's related to maxmind [18:55:59] dse yea [18:56:02] " It is important to remember that the in-memory footprint of deserialized input might significantly vary from the on-disk footprint; for example, certain class of Pig applications result in 3x-4x blow up of on-disk data in-memory. " [18:56:04] from [18:56:07] http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/ [18:56:22] not sure if that applies here, but pig is memory hungry [18:56:25] but it should be reading line-by-line? [18:56:30] (and slower) [18:57:30] but that was my first question, how often does pig emit the key value pairs, does it do it after every line or does it hold to them for a while [18:59:53] ottomata: cool: https://github.com/ooyala/miyamoto [19:04:29] that's cool, basically syncing the manifests and running puppet locally [19:04:35] yeah [19:04:38] clever idea [19:05:00] WOOT, it works on a whole month now [19:05:04] i'm going to try forever [19:06:48] http://analytics1003.eqiad.wmnet:50030/jobdetails.jsp?jobid=job_201209251637_0062&refresh=30 [19:07:05] woot [19:07:16] we should keep track of env (num# tasks, reducers, vm memory variables and running time [19:07:46] google spreadsheet, 4sho [19:07:53] (it's a good idea) [19:08:14] basically every attempt is an experiment [19:08:47] drdee: should we worry about filter being a common name ? [19:08:55] i can add to this: [19:08:55] https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0AvpRkIqSY9hNdEtRLVNoQWNvQzNleHBtTXR5emI3Z2c&pli=1#gid=0 [19:08:58] no, the name sucks [19:09:09] but let's not worry about it [19:09:14] ok [19:09:31] ottomata: cool [19:13:30] yeah so probably what the OOM caused was that hadoop's combiner was aggregating too much data before sending it to the reducers [19:13:41] so by disabling the combiner that 'solved' the problem [19:13:45] can't you set that? [19:13:47] but it's still a hack [19:13:50] the threshold? [19:13:56] you would think so :) [19:14:16] brb coffee time!! [19:17:41] hmm, i would think it would be smarter than that [19:18:05] hm, does that mean [19:18:20] that each map task was reading in and working with too much data? [19:18:29] i'm not sure. [19:18:41] or hm, combiner? is that the piece that is taking data from the mappers and giving it to reducers? [19:18:46] both pig and hive are complex enough that i feel i need to spend an evening wiht the docs [19:18:51] yeah [19:18:58] meeee tooo, i've just been hacking around thus far [19:22:11] heh....that was silly. looks like Andrew already did the task I just entered [19:22:41] what is analytics about if not precience? [19:23:02] and i already checked it off :) [19:23:03] or take that back...it was dschoon a week or so ago [19:23:11] what was it? [19:23:15] er...no, misreading history [19:23:18] ignore me :) [19:23:27] udpating status page [19:23:35] ah, yeah. [19:23:36] that was me. [19:24:36] and me!~ [19:25:17] who should update https://www.mediawiki.org/wiki/Analytics/Reportcard ? [19:25:31] probably dan/me/dieds [19:26:22] want something in Asana for that? [19:27:01] sure! [19:28:10] dschoon: assigned to you for now. feel free to pass around [19:28:16] coolio. thx [19:28:26] is there a way to have a recurring task in Asana? [19:28:54] not that i know of [19:29:09] bummer....would be super handy for this [19:29:14] yeah. [19:29:22] i'll let you know if that changes :) [19:38:38] you can tag an asana task as monthly or weekly [19:38:48] it's not recurring but close [19:39:07] ottomata: yes combiner gets data from mappers and sends it to the reducers [19:39:49] hey ottomata, you noticed ACCESSING_NON_EXISTENT_FIELD is 218,108,757, right? [19:44:17] drdee: partial review imminent [19:47:01] drdee: you did some updates on debianize.sh while I was working, can I merge them into my branch now ? [19:47:07] drdee: before the review I mean ? [19:47:15] no just run git pull [19:47:25] drdee: I ran pull on master [19:47:34] drdee: should I merge the changes into my branch before I git review ? [19:47:49] yes [19:47:53] ok [19:49:38] drdee: https://gerrit.wikimedia.org/r/25542 [19:49:43] please review [19:52:53] hey fabian! [19:54:25] average_drifter: merged, there is one tiny thing left, there should be a command line switch for filtering out page views by bots (user_agent_is_bot()) [19:54:43] drdee: yes sorry, I was just reading the backlog to see what I missed [19:54:51] and maybe rename the 'test' param to 'debug' [19:54:53] drdee: thanks [19:54:55] but that's really all [19:54:58] ok [19:58:01] ottomata: woo! it completed! [19:58:06] woot! [19:58:07] yup [19:58:12] 50 mins [19:59:47] to geocode and count by country requests from 500GB of sampled logs [20:00:10] with 7 nodes [20:00:16] nice. [20:00:20] that's totally reasonable. [20:01:04] 500G represents a few hours of unsampled logs, iirc [20:01:20] so that would be 1.4 Gb per node per minute [20:01:35] a few hours? [20:01:45] this is sampled 1000 [20:01:49] :D [20:01:55] from nov 2011 to mid june 2012 [20:01:56] this is not a few hours [20:02:01] i know. [20:02:06] oh unsampled [20:02:08] sorry missed that word [20:02:17] but i thought unsampled we generated a file that was ~10G in half an hour [20:02:26] git review is sometimes a pain but I really like the idea of a review tool. I like the way you guys work. Are you doing this for a long time now ? [20:02:27] give or take a single-digit scalar [20:02:32] yes [20:02:57] which means 500G is maybe 10-25 hours of unsampled sata [20:02:58] so this would be a few minutes of unsampled [20:02:59] *data [20:03:25] 20G/hr = 25h [20:03:30] not minutes :P [20:04:06] i think we guessed something like 300G / day [20:17:24] drdee [20:17:27] Scrub javascript from user-agent string to prevent injection attack [20:17:29] ? [20:21:54] ottomata, yes….. it is quite likely that people are trying this or will try this and such user_agents will be unique and we could expose them back in a html frontend that would trigger them [20:23:25] ahhh back in an html frontend [20:23:51] heh [20:23:55] i mean [20:23:56] aren't we going to use that lib to map agents to a db of devices/browsers anyway? [20:24:08] we just need to make sure that we escape things before presenting them :P [20:24:11] also, i think the frontend shoudl probably do that [20:24:12] yeah [20:24:17] html tags, more or less. [20:24:19] i don't think kraken should worry about santitizing what it is given [20:24:23] i don't think we should bother sanitizing [20:24:26] ^^ agree with ottomata [20:24:38] also, ha, where do you come up with this stuff!? haha [20:25:01] i'm picturing you there just sitting, working, and then all of the sudden "AH! I will give otto a todo to santitize user agents!" [20:25:16] i didn't assign it to you right, [20:25:26] hehe [20:25:31] and in this case i was talking with stefan [20:25:36] i think you are missing the part where he cackles, ottomata [20:26:09] oh you are right, just added cool [20:26:11] and we were talking about other vulnerabilities as well, so i just put it very low in asana, nothing urgent [20:26:13] i just got the email and assumed [20:26:32] mostly as reminder to myself [20:34:54] i think this is a clear case of me being victorious :D [20:35:34] ottomata, dschoon ^^ [20:35:38] oh? [20:35:47]