[00:15:08] is /a on stat1 shared among many machines ? [00:15:09] dschoon, milimetric: quick limn q [00:15:09] spetrea@stat1:/a/wikistats_git$ mount | grep "/a" [00:15:09] /dev/mapper/stat1-a on /a type ext4 (rw) [00:15:34] how do i make it so that the graph doesn't display certain data points at all? [00:18:02] for the record [00:18:05] i feel like death [00:18:11] erosen: you leave them undefined [00:18:21] great [00:18:26] 2013/01/12,10,,3 [00:18:31] thanks [00:18:33] sorry to hear that [00:18:37] that third entry is undefined [00:18:41] I was surprised to see you hanging on [00:18:49] trust me [00:18:54] i am as displeased as anyone [00:18:58] hehe [00:28:20] hey erosen [00:28:24] hey [00:28:28] was helping drdee for a while [00:28:33] np [00:28:38] so this graph has missing data: [00:28:55] i actually checked this one: http://dev-reportcard.wmflabs.org/ [00:29:03] and it looks empty indeed [00:29:06] http://dev-reportcard.wmflabs.org/graphs/active_editors_target [00:29:16] http://dev-reportcard.wmflabs.org/data/datafiles/rc/rc_active_editors_target.csv [00:30:57] word [00:30:58] thanks [00:31:04] does that help? [00:31:18] yup [00:31:29] i just put consecutive commas: ,, [00:31:38] it's like metric-list;metric-list;metric-list [00:31:45] where metric-list is value,value,value [00:31:49] and you can have trailing , [00:31:51] or trailing ; [00:31:55] and it doesn't seem to care much :) [00:32:23] oh interesting... i wonder if that's not displaying right though [00:32:49] oh sorry this is such weird format [00:33:00] np [00:33:20] i'm actually a bit confused by the semicolons [00:33:35] is that your way of symbolically denoting columns? [00:36:09] i think this changed [00:36:15] and somehow is magically still handled by new limn [00:36:45] just use this instead (one sec) [00:37:47] oh damn gerrit, i'm gonna email you this file [00:37:58] hehe [00:40:17] ok this file format is quite crazy [00:40:36] i sent you the file we have in the reportcard-data repository which I think is a bit more sane [00:40:56] I'll keep in mind to revisit this [00:47:24] milimetric: thanks I got the file [01:33:33] damn I hope I have no syntax errors [01:33:48] erm, or other stuff [01:33:58] reporting reached 18-December-2012 [01:39:02] so it finished? [01:39:08] maybe not all files are on stat1? [01:46:36] all the files are on stat1 [01:47:00] didn't finish yet, it's at 21-December. It has to reach 1st January 2013 to finish [01:51:30] ok [01:53:53] drums please! [01:55:46] :D [01:59:53] I also have an UDP joke [02:02:10] shoot [02:02:44] I wanted to tell an UDP joke but I was afraid you wouldn't get it [02:03:02] ROFLOL [02:03:05] :D :D :D [02:03:08] :) [02:36:53] drdee: http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r10/pageviews.html [02:37:32] after 25h12m [02:38:04] nice! [02:38:08] so this does not have: [02:38:13] * bot filtering [02:38:27] * deduplication of API requests [02:39:00] we are talking with tomasz on wednesday if we can have bot filtering ready by then, that would be really awesome [02:39:22] I think deduplication will be hard [02:39:31] yes it will be :) [02:39:37] about the discarded line [02:39:44] yes, those numbers are huge [02:39:46] that's not in millions , right? [02:40:27] so the sigma column contains actually processed lines, and the discarded column has discarded lines(because of time or url column being incorrect or corrupted) [02:41:42] it looks like it's in millions [02:44:20] well actually the discarded line should not be multiplied by 1000 [02:45:33] oh, why ? [02:46:46] because ideally we would divide the number of discarded lines by the total number of lines in a day and display a percentage [02:47:07] absolute numbers don't say so much [02:48:55] there are some problems with the discarded lines. sometimes a line is discarded because the time-field was invalid (There are cases) [02:49:14] uhm, but yeah I can associate it to a month [02:49:19] what's wrong with the timefield? [02:49:35] i would not discard those lines [02:49:39] so where is the bot filtering in wikistats right now? [02:49:48] sometimes it's not in the YYYY-MM-DDTHH:MM:SS format [02:50:00] but it has milliseconds as well? [02:50:12] that's a difference between varnish and squid servers [02:55:38] average_drifter ^^ [02:55:56] yes it has milliseconds [02:56:03] so varnish has YYYY-MM-DDTHH:MM:SS [02:56:10] and squid has UNIX epoch ? [02:56:14] or viceversa ? [02:56:25] no [02:56:28] one has YYYY-MM-DDTHH:MM:SS [02:56:31] yes [02:56:38] and the other has YYYY-MM-DDTHH:MM:SS:mmmm [02:56:43] or something like taht [02:56:44] yes [02:56:54] uhm, well, I am discarding the mmmm [02:57:11] can you show me an example of a timestamp that you are discarding [02:57:11] but the problem is, sometimes I am getting things that look like UNIX timestamps in that column [02:57:15] yes [02:57:18] k [02:59:52] and where is the bot filter code in wikistats? [03:02:15] looking [03:03:14] ty [03:12:50] i think ez does something like this $agent =~ s/(bot|spider|crawl(?:er)?)/ for bot detection [03:16:41] it seems there's a BotsAll.csv [03:16:55] and it is created by reading the Wikipedia XML dump [03:17:05] I might be wrong, I need to look a bit more at the codfe [03:17:07] *code [03:17:43] what I meant above is that the Wikipedia XML dump is read and revisions of articles show traces of bots in them [03:18:01] and that's one source where bots are extracted from [03:18:10] the IPs I presume [03:18:41] sorry :) [03:18:52] those are bots that *edit* wikipedia, not *visit* wikipedia [03:18:57] so you can ignore those [03:19:03] ok [03:19:09] we only have to look at user agent strings [03:19:16] alright, I'll do a bit more grepping [03:19:19] and i think that ez has a pretty simple heuristic [03:21:23] found it [03:22:52] ..... [03:23:14] wanted to paste.. my console is problematic SquidCountArchiveProcessLogRecord.pm +268 [03:23:19] line 268 [03:23:54] there are bots and googlebots [03:24:14] these are the two types of bots showing up in wikistats [03:24:26] can you just copy / paste that code? [03:26:34] you also need to add code to ignore WMF traffic [03:26:51] (basically you need to mimic also webstatscollector business logic) [03:29:44] https://gist.github.com/0107f1aea4674e30f5f9 [03:29:59] all of this was found in SquidCountArchiveProcessLogRecord.pm [03:30:29] drdee: ^^ [03:30:57] drdee: what can I do about the running time ? it will increase as I add these checks [03:31:14] yes welcome to my world [03:31:20] :) [03:31:43] I think I'm going take out some weapons out of my toolbox [03:32:23] line 65-75 can be simplified [03:32:32] we only want to know if it's a google bot or not [03:32:37] we don't care about the type of googlebot [03:33:04] ok [03:33:07] the ip range checks….. [03:33:12] yes [03:33:22] we can start with user agents strings and see how well that works [03:33:39] and then do a test to see how many more bots we would identify if we would use ip range checks [03:35:32] ok [03:40:27] ": I think I'm going take out some weapons out of my toolbox" [03:40:41] what do you want to do? [03:42:03] replace slow parts with some XS [03:42:17] * average_drifter dodges [03:42:28] XS? [03:42:35] yes, JNI for Perl [03:42:55] i understand that you wanna do that but it's not worth it [03:43:00] ok [03:43:09] kraken is the solution to lack of speed [03:43:11] I was only thinking of doing it so I could roll out reports faster [03:43:14] ok [03:43:36] while debugging you can just rerun november/december [03:43:49] ok I'll focus on those months [06:28:04] test [06:28:11] ok it works I guess [14:09:51] morning! [14:09:59] hey drdee [14:13:32] morning [14:16:46] mooooornign milimetric! [14:16:52] moring louisdang [14:17:12] so it turns out OpenJDK has some known problems. Stability, performance, and graphics [14:17:16] :)) [14:17:23] pushed parent maven pom [14:17:27] I'm switching to closed source Oracle Java for now unfortunately [14:17:39] so all our future maven projects can inherit from that one [14:17:42] it's in kraken/maven [14:18:14] cool [14:20:18] i also figured out how to use our maven nexus repo [14:20:32] there is an example settings.xml in kraken/maven as well [14:23:56] moooorning! [14:24:06] mooooooorning ottomata!!!!!! [14:24:35] good morning [14:24:49] ottomata, would you have some time today to dive into that VUMI thing? [14:25:12] (at least i now know where everything is located) [14:27:26] sure, i think so [14:28:06] grabbing some coffee, let me know when [14:28:38] ok,i need to do some other things first [14:28:41] have to do an ops thing [14:28:45] aight [14:28:49] brb coffee [14:38:44] For all those working on Java. IntelliJ IDEA is released under the Apache license and is a LOT better than Eclipse. Faster, way better debugger, friendlier, in general just kicks Eclipse's ass: http://www.jetbrains.com/idea/free_java_ide.html [14:39:09] drdee, ottomata, dschoon, dschoon_ ^^ [14:39:54] cool [14:55:02] ty [15:24:53] ok drdee, let's doooo it [15:25:01] okdioki [15:25:14] is it just a udp2log instance somewhere? [15:30:39] drdee? [15:30:49] vumi-metrics on labas [15:31:09] it is not a udp2log instance [15:31:24] it's vumi and it has support to send data to a udp2log instance [15:31:53] right, but we don't have to do anythign with the actual vumi bit, right? [15:32:06] we are just setting up a collection point when they deploy to production? [15:37:07] yes, that's my understanding, but have a look at /etc/puppet/files/mobile/vumi/supervisord.wikipedia.conf on vumi-metrics [15:37:15] drdee? [15:37:19] : yes, that's my understanding, but have a look at /etc/puppet/files/mobile/vumi/supervisord.wikipedia.conf on vumi-metrics [15:37:40] yeah i've looked at that [15:37:45] but that has nothing to do with us, right? [15:37:51] that's their VUMI server setup [15:38:16] all we care about is whatever final value they set [15:38:16] --set-option=metrics_host:localhost [15:38:16] --set-option=metrics_port:5678 [15:38:17] to [15:38:25] i think we shoudl run this on oxygen [15:38:31] so we need them to put oxygen's IP there [15:38:49] my q to them (and you) is [15:38:55] when are they deploying? [15:39:02] and am I correct in my assumption for what we need to do? [15:44:15] i don't; know when they are deploying, that's why i want to have a talk with jeffrey [15:44:20] ha, ok [15:44:25] but you wanted me to work on this today, right? [15:44:31] shoudl I set up the udp2log instance for this? [15:44:38] yes ,just to make sure it works [15:44:48] and we now what type of data is send [15:45:01] now==know [15:45:01] they are just sending udp packets to an address, it'll work [15:45:08] where are they deploying? eqiad? [15:45:11] but in what format? [15:45:15] does it matter? [15:45:32] mmmmm maybe not :) [15:45:38] you know [15:45:42] they could just hit the event.gif url :) [15:46:26] i am happy if we have confirmed that we receive actual data and have stored it in a file on oxygen [15:53:17] well, we won't receive any actual data until they deploy, right? [15:56:04] but i thought we could fake it [15:56:56] i can set up a udp2log instance on the vumi-metrics instance [15:56:57] no probs [15:57:22] ok sounds good [15:57:33] i pm'ed you with instructions on how to fake test data [16:45:40] ha, drdee, the stats user is also in ldap, and the analytics all use ldap [16:45:45] so stats user is already available! [16:45:48] :D [16:45:49] nice [16:46:04] just spent a while tryign to figure out why puppet didn't want to add a manual account for the user! [16:46:11] the user already existed! [17:01:38] drdee, brain bounce with me about the log date content problem [17:01:45] maybe we need an action at the start of the workflow [17:01:49] love brainbouncing [17:01:57] that examines the logs around before and after the current dataset [17:02:10] and generates a new file that only includes the desired timespan [17:02:15] so this is about the problem of having data from n+1 in the log file of n? [17:02:19] and then the oozie job will work on that [17:02:27] or n-1 yes [17:02:42] isn't it easier to fix in the python script? [17:02:55] what python script? [17:03:01] the kafka-hadoop-consume [17:03:03] you mean hadoop kafka importer [17:03:08] that is actually a java mapred program [17:03:12] python is just a wrapper [17:03:14] but [17:03:15] probably not [17:03:44] because it is consuming everything from kafka, it would have to examine all of the log lines to figure out what goes where...HMMMMMmmmmmmm [17:03:44] yes maybe you are right [17:03:47] it could do that, its mapred [17:03:55] so example: [17:04:10] curr date is jan 8 [17:04:22] let's say consumer ran and consumed a buncha jan 7 data and jan 8 data [17:04:43] ideally, it would go to separate files right at that moment [17:04:46] it would know to store what where based on frequency parameter [17:04:55] but then tomorrow [17:04:57] on jan 9 [17:05:08] it would consume the new jan 8 data, and whatever has been written for jan 9 [17:05:19] it would have to write new files into the same jan 8 directory it created the day before [17:05:38] which is tricky [17:05:56] yeah totally tricky [17:06:12] man, we should really start using storm for the imports, this kafka hadoop importer thing is a temporary solution [17:06:23] that is the real solution? [17:06:33] the original plan is to consume from kafka via storm [17:06:36] storm would do ETL stuff [17:06:38] and then write to hdfs [17:06:45] because if it is, then i would say let's not bother [17:06:59] let's not bother with the kafka-hadoop importer you mean? [17:06:59] right so we can just run storm without ETL [17:07:02] right [17:07:02] yes [17:07:07] i think that would be better in long term, and probably easier [17:07:13] let's do that then [17:07:43] ok, so we need dschoon to finish setting up whatever java stuff he was working on, and tell us how to use it i guess [17:07:54] what java stuff? [17:07:59] storm already has an hdfs sink, i will see what I can do with it [17:08:06] i dunno, somatype repo, dev env? [17:08:15] i fixed most of that last night [17:08:23] soma type repo is working [17:08:25] oh? i mean, I don't know anything about it, or how to use it [17:08:37] check in kraken repo the maven folder [17:08:43] i think that was all working already [17:09:00] i was working on a tutorial [17:09:03] right, i think I just don't know what to do if I want to start deving with java [17:09:05] but y'all are smart boys [17:09:13] it contains example.settings.xml that you have to copy to your .m2/settings.xml file [17:09:14] exactly [17:09:20] and adjust it [17:09:23] i was going to set up proxying [17:09:29] and possibly archiva [17:09:34] but i dunno about that part [17:09:44] it seems like it handles binaries better than other repos [17:12:13] eclipse is really sucking balls [17:15:35] it's very, very good at that [17:17:48] i'm trying out this intellij thing milimetric suggested, no idea what I'm doing [17:18:02] do I need to tell it about ~/.m2 when I import kraken as a project? i unnooooo [17:18:14] it should always use that [17:18:31] it's mavenized right? [17:18:40] make sure you have installed maven :) [17:18:41] it has a pom.xml in other words [17:18:44] that will create ~/.m2 [17:18:45] nono [17:18:47] that's maven standard [17:18:58] you just have to point IDEA at the pom.xml [17:19:07] you go Import Project -> browse to pom.xml [17:19:09] yes, the kraken repo has a maven component [17:19:18] and it has a pom [17:19:25] ottomata ^ [17:19:25] so it knows to use all the maven defaults [17:19:31] two actually :) a parent pom (just pushed this morning) and a project pom for kraken.jar [17:19:55] yay [17:19:57] the parent pom uses the nexus sonatype [17:20:10] the kraken pom does not yet inherit from the parent pom [17:20:18] so go with the kraken pom.xml [17:20:40] ? [17:20:41] not hard. just a section [17:20:42] iir [17:20:43] c [17:20:51] i need to sleep :) [17:20:52] i have one pom.xml [17:20:54] back later [17:20:55] maven/pom.xml [17:20:56] ok lataas [17:21:18] so right after you open IDEA [17:21:20] Import Project [17:21:29] and browse to that pom [17:21:40] does that work? [17:21:48] (I'm doing it too) [17:22:22] wait, there's a pom.xml right in the root of kraken [17:22:24] yeah, but hm, sortof [17:22:25] ohhhhh [17:22:29] that's probablyt he one I want [17:22:34] there is? [17:22:35] i don't have that [17:22:48] oh, i have old source [17:22:52] importing from maven/pom.xml worked, but i think it used maven/ is the root [17:22:54] Diederik decided to mess with us [17:22:57] and didn't get any of the existing code [17:23:03] that's the parent pom [17:23:08] drdee, where's the old pom? [17:23:12] and what's a parent pom? [17:23:30] just in src/ [17:23:49] other pom's can inherit from the parent pom [17:23:58] but none of the project actually use the parent pom [17:23:59] i [17:24:00] [17:24:08] i just pushed it to github [17:24:21] so maven/pom.xml is the parent pom [17:24:33] src/pom.xml is the kraken project pom [17:24:34] and [17:24:35] sue [17:24:35]  [17:24:40] and use that one [17:24:52] src/pom.xl does not exist [17:24:54] you in the master branch? [17:25:04] yes, src/pom.xml isn't in master [17:25:27] i think you're saying there should be a /pom.xml that points to the maven/pom.xml as the parent [17:25:29] correct? [17:25:50] ?????? [17:25:52] what? [17:26:08] https://github.com/wmf-analytics/kraken/tree/master/src [17:26:20] no pom ^ [17:26:32] heh, my font makes that look like porn [17:26:36] i see a pom.xml in the main kraken/ dir in the standardize_timestamp branch [17:27:24] can you quickly save that one, i think i made a small mistake [17:28:08] got it [17:28:14] hold on [17:29:21] okay i put it back in the root folder of kraken [17:29:23] sorry [17:30:10] cool, pulling [17:31:12] ottomata, after importing with the pom at the root, I checked that "Import Maven Projects automatically" [17:31:35] sounded like that keeps the IDEA project in sync with the pom which seems like a good idea [17:31:59] if you hover over stuff, there's usually helpful tooltips unlike a certain open source project I won't mention :) [17:33:02] cool ok [17:35:59] hmmm so if I want to add storm and hadoop dependencies.... [17:36:03] how do I do that? [17:36:21] Oh i see [17:36:23] in the root pom.xml [17:36:25] there are dependencies [17:36:36] and I guess those are in the nexus repo already, right? [17:36:36] or something? [17:36:39] ah no, cloudera has one [17:36:40] ok [17:36:45] no not tyet [17:36:47] so I should be able to add a storm repo or something? [17:37:02] i would leave maven aside for now and just work on the code [17:37:13] once it works, mavenization will be clear [17:37:26] kinda like this: [17:37:27] https://github.com/nathanmarz/storm-starter/blob/master/m2-pom.xml [17:37:30] oh? [17:37:39] but I will need to compile with the dependencies when I work on the code, won't I? [17:37:54] ottomata, I'm a bit behind on where you are with the limnification of the datasource you created [17:38:20] yes, but do you know what dependencies you need? [17:38:24] well, i left that off, because there's a problem [17:38:26] with the data [17:39:45] um, storm? [17:39:55] [17:39:56] storm [17:39:56] storm [17:39:56] 0.8.1 [17:39:56] [17:39:56] provided [17:39:56] [17:39:57] ? [17:40:44] i would remove scope [17:40:57] oh maybe not keep it [17:41:03] yes just add it to pom.xml [17:41:59] groupId is probably slighty different [17:42:10] but that's a guess [17:42:47] https://github.com/nathanmarz/storm/wiki/Maven [17:42:53] ok so it does look good [17:42:59] you do need to add the repo: [17:43:00] https://github.com/nathanmarz/storm/wiki/Maven [17:43:06] [17:43:07] clojars.org [17:43:08] http://clojars.org/repo [17:43:08] [17:45:14] ottomata ^^ [17:45:17] hmm, ok [17:46:13] ok, i'm going to try intellij with the storm-starter project first, and see if I can get it to run a storm topology locally [17:46:19] ok [17:46:21] if I get that far I think I'll understand what needs to happen for kraken [17:46:24] let me know if i can help you [17:46:50] k danke [17:52:12] ok drdee, milimetric [17:52:15] no idea what i'm doing here [17:52:17] added clojars.org as proxy repo to our nexus repo [17:52:32] what are you trying to do? [17:52:32] i got storm-starter cloned and imported in intellij [17:52:42] it comes with a wordcoutn topology example [17:52:54] trying to compile it [17:53:05] I *think* I synced sources, or somethign [17:53:07] it has things like [17:53:08] import backtype.storm.Config; [17:53:18] but i'm getting package does not exist errors [17:53:22] shoot me the repo you imported, I'll try it before our standup [17:53:29] https://github.com/nathanmarz/storm-starter [17:53:45] i just have no idea what i'm doing here [17:53:53] i have no experience with java IDEs and dependencies [17:54:09] so I have this thing all imported [17:54:10] and i'm like [17:54:12] hm, that word count says something about python [17:54:14] ok, now what? :) [17:54:23] yeah storm does multilang [17:54:30] i think it uses a python bolt in the process [17:54:39] but, if that class was compiled [17:54:56] you could run it with java -Dexec.mainClass=storm.starter.WordCountTopology [17:54:58] did you install this leiningen thing? [17:55:04] naw, do I need to? [17:55:13] i think that's for clojure stuff [17:55:28] Maven is an alternative to Leiningen. [17:55:30] oh sorry [17:55:30] yep [17:55:56] i was able to get this to work last spring when I was first playing with it [17:56:04] but I downloaded the deps manually and put them on my classpath [17:56:07] and compiled, etc. [17:57:53] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [18:22:43] erosen, milimetric: woohoo, all oiled and ready! [18:22:43] http://www.flickr.com/photos/ottomatona/8360847841/in/photostream/ [18:24:18] looks nice! [18:59:49] dschoon, ping me when you get back, I'm in the process of deploying and I'm finishing the feature/d3 branch [19:00:03] but i know you're sick so don't worry unless you're gonna work on it [19:06:15] hye milimetric [19:06:20] what do you think is better for an unknown continent? [19:06:21] - [19:06:22] or [19:06:23] hi [19:06:23] unknown [19:06:25] or something else? [19:06:46] I like Unknown [19:07:10] capital? ok [19:07:16] yeah, so it matches the others [19:07:19] people might want to report on that [19:07:24] ok cool [19:07:30] it'd be nice to see that number go down :) [19:09:21] yeah need to figure out why there are so many [19:13:09] hmmmm [19:13:14] milimetric, the reason why there are unknowns [19:13:21] is beacuse some IPs geocode to continents with no country [19:13:31] $ geoiplookup 195.212.29.166 [19:13:31] GeoIP Country Edition: EU, Europe [19:13:31] GeoIP City Edition, Rev 1: EU, N/A, N/A, N/A, 47.000000, 8.000000, 0, 0 [19:13:53] http://www.maxmind.com/en/geoip_demo [19:17:20] yeah, so this number is definitely useful and hopefully as GeoIP improves we'll see the number go down [19:17:37] until then people could use it to calculate error on the other numbers [19:18:03] well, some of these I can probably get the continent out of the country name [19:18:08] that one there is definitely europe [19:18:13] the UDF just doesn't know what country "EU" is in [19:18:15] sorry [19:18:19] what continent "EU" is in [19:18:26] since it is expecting a country code [19:38:39] ungh! [19:38:45] drdee, what do you think I should do [19:38:54] if sometimes the countryCode is actually the continentCode [19:39:26] i would add a check for that, the number of continents is very limited [19:39:32] its more difficult [19:39:32] um [19:39:36] ohhh [19:39:41] AF is a counryCode and a continentCode [19:39:49] africa, afganistan [19:39:50] crap [19:40:09] do you know all the exceptions? [19:40:09] i looked in just one log file [19:40:11] use the alpha-3 codes [19:40:18] i only found two [19:40:21] for counries [19:40:24] *countries [19:40:31] EU, Europe [19:40:31] A2, Satellite Provider [19:40:41] hmmmm, can I get that from the maxmind db? [19:40:45] countries? [19:40:47] yes [19:40:49] 3 letter [19:41:28] https://gerrit.wikimedia.org/r/gitweb?p=analytics/reportcard/data.git;a=blob;f=geo/country-codes.json;h=5f926143aad32242f593fd3e9b1f295f845a3392;hb=refs/heads/develop [19:41:51] mapping of alpha-2, alpha-3, and a bunch of other codes to names and such [19:42:02] IOC is the olympic codes [19:42:05] FIFA codes [19:42:28] a2 is what maxmind returns by default, iirc [19:42:58] i coded that file in a bunch of forms [19:43:02] https://gerrit.wikimedia.org/r/gitweb?p=analytics/reportcard/data.git;a=tree;f=geo;h=2996adc4076e4c821f7ef0f21e3b3e99f0b5bb84;hb=refs/heads/develop [19:43:47] siigh, why are things always harder than they should be! [19:45:36] can't we just hardcode the exceptions? [19:46:53] i think the answer is that there ARE no country codes [19:46:58] er [19:46:59] heh [19:47:03] no CONTINENT codes [19:47:16] (i'm sick. give me a break) [19:47:30] drdee, i'll check for exceptions ina days worth of all.100 data [19:47:36] if I only see those two, i'll just do that [19:47:58] sounds good to me [20:11:10] ergh, drdee, there are more [20:11:25] geoiplookup 206.53.148.209 [20:11:25] GeoIP Country Edition: AP, Asia/Pacific Region [20:11:42] lots of 'anonymous proxy' [20:11:51] welp [20:11:54] i think though [20:11:56] the whole list is http://www.maxmind.com/en/iso3166 [20:12:04] there are things that the db returns that are not on that list [20:12:14] really? [20:12:16] yes [20:12:25] i mean, there are things in that list that are not in 3166 [20:12:29] $ geoiplookup 65.49.68.181 [20:12:29] GeoIP Country Edition: A1, Anonymous Proxy [20:12:39] that's... #1 [20:12:40] on the list [20:12:41] A1 is not a country [20:12:43] that i just linked [20:12:45] oh maxmind [20:12:50] sorry i was looking at wikipedia page [20:12:54] hmmmmmmmm ok [20:12:54] the list there has a bunch of stuff that's not in 3166 [20:12:56] I see [20:12:57] as i said [20:13:01] O1 [20:13:04] "Other country" [20:13:05] aye aye i get it [20:13:06] Helpful! [20:13:24] AP,"Asia/Pacific Region" [20:20:24] ok cool, thanks dschoon, that was very helpful [20:20:26] there are 5 exceptions [20:20:28] I can hardcode them [20:20:31] awesome [20:21:31] but! [20:21:32] question [20:21:40] is asia/pacific in asia? or oceania? [20:24:08] what is the ip address that returns asia/pacific? [20:24:38] here's one [20:24:39] 206.53.152.167 [20:25:29] that is in china, according to google maps [20:25:48] actually [20:25:49] can't really tel [20:25:58] that's just according to the coordinates that maxmind gives back for those IPs [20:26:00] those are in china [20:26:03] so I guess I'll pick asia [20:26:07] chine [20:26:09] china, http://whatismyipaddress.com/ip/206.53.152.167 [20:49:27] I heard on a different IRC channel that in China, if you have a Linux laptop, interwebz would disconnect every 5minutes [20:49:43] and they're forced to use Windows [20:50:21] I guess that would be a good criterion to check for small percentage of Linux in China [20:50:37] or vetting as ez calls it [20:54:54] might not be true.. [20:55:13] hey drdee, [20:55:22] where do user generated warnings in pig go? [20:55:23] warn("getLocation() returned null on input: " + ip, PigWarning.UDF_WARNING_1); [20:55:28] yooooo [20:56:06] one of the many hadoop log files i assume :D [20:56:14] ok looking [21:30:47] drdee, since pulling kraken repo on mvn package: [21:30:47] [ERROR] /home/otto/kraken/src/main/java/org/wikimedia/analytics/kraken/pig/isValidIPv4Address.java:[25,7] class IsValidIPv4Address is public, should be declared in a fi [21:31:02] [ERROR] /home/otto/kraken/src/main/java/org/wikimedia/analytics/kraken/pig/isValidIPv6Address.java:[26,7] class IsValidIPv6Address is public, should be declared in a file named IsValidIPv6Address.java [21:31:20] i thought i had changed those filenames [21:32:03] nope i didn't, 1 sec [21:33:38] very weird [21:33:53] i renamed the files but git does not see it [21:33:56] so i can't commit [21:33:57] oh boy [21:34:01] that's because you are on your mac [21:34:01] or push [21:34:06] and mac is case insenstive [21:34:14] I can probably do it on an01 [21:34:25] but our local repos are going to be unhappy when we pull [21:34:27] will have to reclone [21:34:46] or maybe you already did this, i dunno [21:34:48] and a reclone will help? [21:35:00] mac os not case insensitive with filenames, is it? [21:35:08] yup [21:35:20] its really bad [21:35:31] the default FS is installed insenstive [21:35:33] you can install sensitive [21:35:35] I did that one [21:35:35] once [21:35:39] but it messed a lot of things up [21:36:45] hold on [21:37:14] ah [21:37:16] git pull worked anyway [21:37:18] i got it to push [21:37:19] pull now [21:39:15] arrghh [21:39:17] i was fixing it [21:39:44] now it is really broken [21:39:51] the solution was to a temp rename to another file [21:39:59] instead of A -> B [21:40:05] do A-> C -> B [21:43:20] ok pulled [21:43:23] thanks! [21:43:34] was not aware of the case stuff at osx at all [21:50:38] drdee: about case (i)sensitive , when you install OSX you get to choose whether you want case insenstivie/sensitive filenames [21:50:55] that was the case for me when I installed iATKOS(fork of OSX) on my vm [21:50:58] did not know that at all [21:51:02] thanks [21:54:36] yeah but don't change it! [21:54:45] it will bite you later when you are least expecting it [21:54:53] you'll try to install or run some app, and it will be all weirded out [21:55:10] OS X coders dont' bother with being consistent with their path names, I guess [21:55:15] often using Upper case, other times not [22:07:30] heyaa erosen [22:07:34] ok! trying to use limnify [22:08:58] great [22:09:00] i'm in a meeting [22:09:03] so only have half attention [22:09:10] limnify: error: argument --datecol: invalid int value: 'Hour' [22:09:16] limnify --delim='\t' --datefmt="%Y-%m-%d_%H" --datecol Hour ~/pig/krakensrc/c/hour_continent_mobile.tsv [22:09:25] otto@analytics1001:~/scr$ head ~/pig/krakensrc/c/hour_continent_mobile.tsv [22:09:26] Hour Continent Count [22:09:26] 2013-01-01_00 Asia 535984 [22:09:26] 2013-01-01_00 Africa 20536 [22:09:58] did you pass in the columns [22:11:30] limnify: error: argument --datecol: invalid int value: 'Hour' [22:11:34] what am I supposed to put? [22:12:39] ph [22:12:40] oh [22:12:41] sorry [22:12:45] -h gives mroe info [22:12:48] i was just reading the error usage [22:12:50] reading [22:14:57] hmm, the default columns should work [22:15:35] hmm [22:15:42] sorry i can't quite multitask on this now [22:15:47] s'ok, thanks [22:15:48] can I get back in 45? [22:15:55] i'm out in 15 [22:15:56] hmm [22:15:59] but you can try it on an01 if you want [22:16:01] file is [22:16:08] /home/otto/pig/krakensrc/c/hour_continent_mobile.tsv [22:17:04] great [22:17:06] i was going to ask that [22:25:48] alright, thanks for the help, i'm outtaaaa man that took way too much time today, but continents are better now [22:25:49] laters boys! [22:32:48] laterz!!!! [22:38:25] drdee: will show you some neat stuff when you come back :) [22:38:37] oh you were just replying to Andrew :) [22:38:42] ok, almost ready [23:08:04] average_drifter: show me! [23:11:55] drdee: not reaady yet, but I hope very soon [23:12:27] aight [23:51:40] omg that took so much effort and time. I've been through hell and back for the last 10 hours [23:51:40] http://test-reportcard.wmflabs.org/ [23:51:45] it's up. minified [23:51:56] I'm defeated, going to go curl up in a corner and die [23:53:05] or cry. Yeah, that's less dramatic :)