[00:15:08] is /a on stat1 shared among many machines ? [00:15:09] dschoon, milimetric: quick limn q [00:15:09] spetrea@stat1:/a/wikistats_git$ mount | grep "/a" [00:15:09] /dev/mapper/stat1-a on /a type ext4 (rw) [00:15:34] how do i make it so that the graph doesn't display certain data points at all? [00:18:02] for the record [00:18:05] i feel like death [00:18:11] erosen: you leave them undefined [00:18:21] great [00:18:26] 2013/01/12,10,,3 [00:18:31] thanks [00:18:33] sorry to hear that [00:18:37] that third entry is undefined [00:18:41] I was surprised to see you hanging on [00:18:49] trust me [00:18:54] i am as displeased as anyone [00:18:58] hehe [00:28:20] hey erosen [00:28:24] hey [00:28:28] was helping drdee for a while [00:28:33] np [00:28:38] so this graph has missing data: [00:28:55] i actually checked this one: http://dev-reportcard.wmflabs.org/ [00:29:03] and it looks empty indeed [00:29:06] http://dev-reportcard.wmflabs.org/graphs/active_editors_target [00:29:16] http://dev-reportcard.wmflabs.org/data/datafiles/rc/rc_active_editors_target.csv [00:30:57] word [00:30:58] thanks [00:31:04] does that help? [00:31:18] yup [00:31:29] i just put consecutive commas: ,, [00:31:38] it's like metric-list;metric-list;metric-list [00:31:45] where metric-list is value,value,value [00:31:49] and you can have trailing , [00:31:51] or trailing ; [00:31:55] and it doesn't seem to care much :) [00:32:23] oh interesting... i wonder if that's not displaying right though [00:32:49] oh sorry this is such weird format [00:33:00] np [00:33:20] i'm actually a bit confused by the semicolons [00:33:35] is that your way of symbolically denoting columns? [00:36:09] i think this changed [00:36:15] and somehow is magically still handled by new limn [00:36:45] just use this instead (one sec) [00:37:47] oh damn gerrit, i'm gonna email you this file [00:37:58] hehe [00:40:17] ok this file format is quite crazy [00:40:36] i sent you the file we have in the reportcard-data repository which I think is a bit more sane [00:40:56] I'll keep in mind to revisit this [00:47:24] milimetric: thanks I got the file [01:33:33] damn I hope I have no syntax errors [01:33:48] erm, or other stuff [01:33:58] reporting reached 18-December-2012 [01:39:02] so it finished? [01:39:08] maybe not all files are on stat1? [01:46:36] all the files are on stat1 [01:47:00] didn't finish yet, it's at 21-December. It has to reach 1st January 2013 to finish [01:51:30] ok [01:53:53] drums please! [01:55:46] :D [01:59:53] I also have an UDP joke [02:02:10] shoot [02:02:44] I wanted to tell an UDP joke but I was afraid you wouldn't get it [02:03:02] ROFLOL [02:03:05] :D :D :D [02:03:08] :) [02:36:53] drdee: http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r10/pageviews.html [02:37:32] after 25h12m [02:38:04] nice! [02:38:08] so this does not have: [02:38:13] * bot filtering [02:38:27] * deduplication of API requests [02:39:00] we are talking with tomasz on wednesday if we can have bot filtering ready by then, that would be really awesome [02:39:22] I think deduplication will be hard [02:39:31] yes it will be :) [02:39:37] about the discarded line [02:39:44] yes, those numbers are huge [02:39:46] that's not in millions , right? [02:40:27] so the sigma column contains actually processed lines, and the discarded column has discarded lines(because of time or url column being incorrect or corrupted) [02:41:42] it looks like it's in millions [02:44:20] well actually the discarded line should not be multiplied by 1000 [02:45:33] oh, why ? [02:46:46] because ideally we would divide the number of discarded lines by the total number of lines in a day and display a percentage [02:47:07] absolute numbers don't say so much [02:48:55] there are some problems with the discarded lines. sometimes a line is discarded because the time-field was invalid (There are cases) [02:49:14] uhm, but yeah I can associate it to a month [02:49:19] what's wrong with the timefield? [02:49:35] i would not discard those lines [02:49:39] so where is the bot filtering in wikistats right now? [02:49:48] sometimes it's not in the YYYY-MM-DDTHH:MM:SS format [02:50:00] but it has milliseconds as well? [02:50:12] that's a difference between varnish and squid servers [02:55:38] average_drifter ^^ [02:55:56] yes it has milliseconds [02:56:03] so varnish has YYYY-MM-DDTHH:MM:SS [02:56:10] and squid has UNIX epoch ? [02:56:14] or viceversa ? [02:56:25] no [02:56:28] one has YYYY-MM-DDTHH:MM:SS [02:56:31] yes [02:56:38] and the other has YYYY-MM-DDTHH:MM:SS:mmmm [02:56:43] or something like taht [02:56:44] yes [02:56:54] uhm, well, I am discarding the mmmm [02:57:11] can you show me an example of a timestamp that you are discarding [02:57:11] but the problem is, sometimes I am getting things that look like UNIX timestamps in that column [02:57:15] yes [02:57:18] k [02:59:52] and where is the bot filter code in wikistats? [03:02:15] looking [03:03:14] ty [03:12:50] i think ez does something like this $agent =~ s/(bot|spider|crawl(?:er)?)/ for bot detection [03:16:41] it seems there's a BotsAll.csv [03:16:55] and it is created by reading the Wikipedia XML dump [03:17:05] I might be wrong, I need to look a bit more at the codfe [03:17:07] *code [03:17:43] what I meant above is that the Wikipedia XML dump is read and revisions of articles show traces of bots in them [03:18:01] and that's one source where bots are extracted from [03:18:10] the IPs I presume [03:18:41] sorry :) [03:18:52] those are bots that *edit* wikipedia, not *visit* wikipedia [03:18:57] so you can ignore those [03:19:03] ok [03:19:09] we only have to look at user agent strings [03:19:16] alright, I'll do a bit more grepping [03:19:19] and i think that ez has a pretty simple heuristic [03:21:23] found it [03:22:52] ..... [03:23:14] wanted to paste.. my console is problematic SquidCountArchiveProcessLogRecord.pm +268 [03:23:19] line 268 [03:23:54] there are bots and googlebots [03:24:14] these are the two types of bots showing up in wikistats [03:24:26] can you just copy / paste that code? [03:26:34] you also need to add code to ignore WMF traffic [03:26:51] (basically you need to mimic also webstatscollector business logic) [03:29:44] https://gist.github.com/0107f1aea4674e30f5f9 [03:29:59] all of this was found in SquidCountArchiveProcessLogRecord.pm [03:30:29] drdee: ^^ [03:30:57] drdee: what can I do about the running time ? it will increase as I add these checks [03:31:14] yes welcome to my world [03:31:20] :) [03:31:43] I think I'm going take out some weapons out of my toolbox [03:32:23] line 65-75 can be simplified [03:32:32] we only want to know if it's a google bot or not [03:32:37] we don't care about the type of googlebot [03:33:04] ok [03:33:07] the ip range checks….. [03:33:12] yes [03:33:22] we can start with user agents strings and see how well that works [03:33:39] and then do a test to see how many more bots we would identify if we would use ip range checks [03:35:32] ok [03:40:27] ": I think I'm going take out some weapons out of my toolbox" [03:40:41] what do you want to do? [03:42:03] replace slow parts with some XS [03:42:17] * average_drifter dodges [03:42:28] XS? [03:42:35] yes, JNI for Perl [03:42:55] i understand that you wanna do that but it's not worth it [03:43:00] ok [03:43:09] kraken is the solution to lack of speed [03:43:11] I was only thinking of doing it so I could roll out reports faster [03:43:14] ok [03:43:36] while debugging you can just rerun november/december [03:43:49] ok I'll focus on those months [06:28:04] test [06:28:11] ok it works I guess [14:09:51] morning! [14:09:59] hey drdee [14:13:32] morning [14:16:46] mooooornign milimetric! [14:16:52] moring louisdang [14:17:12] so it turns out OpenJDK has some known problems. Stability, performance, and graphics [14:17:16] :)) [14:17:23] pushed parent maven pom [14:17:27] I'm switching to closed source Oracle Java for now unfortunately [14:17:39] so all our future maven projects can inherit from that one [14:17:42] it's in kraken/maven [14:18:14] cool [14:20:18] i also figured out how to use our maven nexus repo [14:20:32] there is an example settings.xml in kraken/maven as well [14:23:56] moooorning! [14:24:06] mooooooorning ottomata!!!!!! [14:24:35] good morning [14:24:49] ottomata, would you have some time today to dive into that VUMI thing? [14:25:12] (at least i now know where everything is located) [14:27:26] sure, i think so [14:28:06] grabbing some coffee, let me know when [14:28:38] ok,i need to do some other things first [14:28:41] have to do an ops thing [14:28:45] aight [14:28:49] brb coffee [14:38:44] For all those working on Java. IntelliJ IDEA is released under the Apache license and is a LOT better than Eclipse. Faster, way better debugger, friendlier, in general just kicks Eclipse's ass: http://www.jetbrains.com/idea/free_java_ide.html [14:39:09] drdee, ottomata, dschoon, dschoon_ ^^ [14:39:54] cool [14:55:02] ty [15:24:53] ok drdee, let's doooo it [15:25:01] okdioki [15:25:14] is it just a udp2log instance somewhere? [15:30:39] drdee? [15:30:49] vumi-metrics on labas [15:31:09] it is not a udp2log instance [15:31:24] it's vumi and it has support to send data to a udp2log instance [15:31:53] right, but we don't have to do anythign with the actual vumi bit, right? [15:32:06] we are just setting up a collection point when they deploy to production? [15:37:07] yes, that's my understanding, but have a look at /etc/puppet/files/mobile/vumi/supervisord.wikipedia.conf on vumi-metrics [15:37:15] drdee? [15:37:19] : yes, that's my understanding, but have a look at /etc/puppet/files/mobile/vumi/supervisord.wikipedia.conf on vumi-metrics [15:37:40] yeah i've looked at that [15:37:45] but that has nothing to do with us, right? [15:37:51] that's their VUMI server setup [15:38:16] all we care about is whatever final value they set [15:38:16] --set-option=metrics_host:localhost [15:38:16] --set-option=metrics_port:5678 [15:38:17] to [15:38:25] i think we shoudl run this on oxygen [15:38:31] so we need them to put oxygen's IP there [15:38:49] my q to them (and you) is [15:38:55] when are they deploying? [15:39:02] and am I correct in my assumption for what we need to do? [15:44:15] i don't; know when they are deploying, that's why i want to have a talk with jeffrey [15:44:20] ha, ok [15:44:25] but you wanted me to work on this today, right? [15:44:31] shoudl I set up the udp2log instance for this? [15:44:38] yes ,just to make sure it works [15:44:48] and we now what type of data is send [15:45:01] now==know [15:45:01] they are just sending udp packets to an address, it'll work [15:45:08] where are they deploying? eqiad? [15:45:11] but in what format? [15:45:15] does it matter? [15:45:32] mmmmm maybe not :) [15:45:38] you know [15:45:42] they could just hit the event.gif url :) [15:46:26] i am happy if we have confirmed that we receive actual data and have stored it in a file on oxygen [15:53:17] well, we won't receive any actual data until they deploy, right? [15:56:04] but i thought we could fake it [15:56:56] i can set up a udp2log instance on the vumi-metrics instance [15:56:57] no probs [15:57:22] ok sounds good [15:57:33] i pm'ed you with instructions on how to fake test data [16:45:40] ha, drdee, the stats user is also in ldap, and the analytics all use ldap [16:45:45] so stats user is already available! [16:45:48] :D [16:45:49] nice [16:46:04] just spent a while tryign to figure out why puppet didn't want to add a manual account for the user! [16:46:11] the user already existed! [17:01:38] drdee, brain bounce with me about the log date content problem [17:01:45] maybe we need an action at the start of the workflow [17:01:49] love brainbouncing [17:01:57] that examines the logs around before and after the current dataset [17:02:10] and generates a new file that only includes the desired timespan [17:02:15] so this is about the problem of having data from n+1 in the log file of n? [17:02:19] and then the oozie job will work on that [17:02:27] or n-1 yes [17:02:42] isn't it easier to fix in the python script? [17:02:55] what python script? [17:03:01] the kafka-hadoop-consume [17:03:03] you mean hadoop kafka importer [17:03:08] that is actually a java mapred program [17:03:12] python is just a wrapper [17:03:14] but [17:03:15] probably not [17:03:44] because it is consuming everything from kafka, it would have to examine all of the log lines to figure out what goes where...HMMMMMmmmmmmm [17:03:44] yes maybe you are right [17:03:47] it could do that, its mapred [17:03:55] so example: [17:04:10] curr date is jan 8 [17:04:22] let's say consumer ran and consumed a buncha jan 7 data and jan 8 data [17:04:43] ideally, it would go to separate files right at that moment [17:04:46] it would know to store what where based on frequency parameter [17:04:55] but then tomorrow [17:04:57] on jan 9 [17:05:08] it would consume the new jan 8 data, and whatever has been written for jan 9 [17:05:19] it would have to write new files into the same jan 8 directory it created the day before [17:05:38] which is tricky [17:05:56] yeah totally tricky [17:06:12] man, we should really start using storm for the imports, this kafka hadoop importer thing is a temporary solution [17:06:23] that is the real solution? [17:06:33] the original plan is to consume from kafka via storm [17:06:36] storm would do ETL stuff [17:06:38] and then write to hdfs [17:06:45] because if it is, then i would say let's not bother [17:06:59] let's not bother with the kafka-hadoop importer you mean? [17:06:59] right so we can just run storm without ETL [17:07:02] right [17:07:02] yes [17:07:07] i think that would be better in long term, and probably easier [17:07:13] let's do that then [17:07:43] ok, so we need dschoon to finish setting up whatever java stuff he was working on, and tell us how to use it i guess [17:07:54] what java stuff? [17:07:59] storm already has an hdfs sink, i will see what I can do with it [17:08:06] i dunno, somatype repo, dev env? [17:08:15] i fixed most of that last night [17:08:23] soma type repo is working [17:08:25] oh? i mean, I don't know anything about it, or how to use it [17:08:37] check in kraken repo the maven folder [17:08:43] i think that was all working already [17:09:00] i was working on a tutorial [17:09:03] right, i think I just don't know what to do if I want to start deving with java [17:09:05] but y'all are smart boys [17:09:13] it contains example.settings.xml that you have to copy to your .m2/settings.xml file [17:09:14] exactly [17:09:20] and adjust it [17:09:23] i was going to set up proxying [17:09:29] and possibly archiva [17:09:34] but i dunno about that part [17:09:44] it seems like it handles binaries better than other repos [17:12:13] eclipse is really sucking balls [17:15:35] it's very, very good at that [17:17:48] i'm trying out this intellij thing milimetric suggested, no idea what I'm doing [17:18:02] do I need to tell it about ~/.m2 when I import kraken as a project? i unnooooo [17:18:14] it should always use that [17:18:31] it's mavenized right? [17:18:40] make sure you have installed maven :) [17:18:41] it has a pom.xml in other words [17:18:44] that will create ~/.m2 [17:18:45] nono [17:18:47] that's maven standard [17:18:58] you just have to point IDEA at the pom.xml [17:19:07] you go Import Project -> browse to pom.xml [17:19:09] yes, the kraken repo has a maven component [17:19:18] and it has a pom [17:19:25] ottomata ^ [17:19:25] so it knows to use all the maven defaults [17:19:31] two actually :) a parent pom (just pushed this morning) and a project pom for kraken.jar [17:19:55] yay [17:19:57] the parent pom uses the nexus sonatype [17:20:10] the kraken pom does not yet inherit from the parent pom [17:20:18] so go with the kraken pom.xml [17:20:40] ? [17:20:41] not hard. just a section [17:20:42] iir [17:20:43] c [17:20:51] i need to sleep :) [17:20:52] i have one pom.xml [17:20:54] back later [17:20:55] maven/pom.xml [17:20:56] ok lataas [17:21:18] so right after you open IDEA [17:21:20] Import Project [17:21:29] and browse to that pom [17:21:40] does that work? [17:21:48] (I'm doing it too) [17:22:22] wait, there's a pom.xml right in the root of kraken [17:22:24] yeah, but hm, sortof [17:22:25] ohhhhh [17:22:29] that's probablyt he one I want [17:22:34] there is? [17:22:35] i don't have that [17:22:48] oh, i have old source [17:22:52] importing from maven/pom.xml worked, but i think it used maven/ is the root [17:22:54] Diederik decided to mess with us [17:22:57] and didn't get any of the existing code [17:23:03] that's the parent pom [17:23:08] drdee, where's the old pom? [17:23:12] and what's a parent pom? [17:23:30] just in src/ [17:23:49] other pom's can inherit from the parent pom [17:23:58] but none of the project actually use the parent pom [17:23:59] i [17:24:00] [17:24:08] i just pushed it to github [17:24:21] so maven/pom.xml is the parent pom [17:24:33] src/pom.xml is the kraken project pom [17:24:34] and [17:24:35] sue [17:24:35] [17:24:40] and use that one [17:24:52] src/pom.xl does not exist [17:24:54] you in the master branch? [17:25:04] yes, src/pom.xml isn't in master [17:25:27] i think you're saying there should be a /pom.xml that points to the maven/pom.xml as the parent [17:25:29] correct? [17:25:50] ?????? [17:25:52] what? [17:26:08] https://github.com/wmf-analytics/kraken/tree/master/src [17:26:20] no pom ^ [17:26:32] heh, my font makes that look like porn [17:26:36] i see a pom.xml in the main kraken/ dir in the standardize_timestamp branch [17:27:24] can you quickly save that one, i think i made a small mistake [17:28:08] got it [17:28:14] hold on [17:29:21] okay i put it back in the root folder of kraken [17:29:23] sorry [17:30:10] cool, pulling [17:31:12] ottomata, after importing with the pom at the root, I checked that "Import Maven Projects automatically" [17:31:35] sounded like that keeps the IDEA project in sync with the pom which seems like a good idea [17:31:59] if you hover over stuff, there's usually helpful tooltips unlike a certain open source project I won't mention :) [17:33:02] cool ok [17:35:59] hmmm so if I want to add storm and hadoop dependencies.... [17:36:03] how do I do that? [17:36:21] Oh i see [17:36:23] in the root pom.xml [17:36:25] there are dependencies [17:36:36] and I guess those are in the nexus repo already, right? [17:36:36] or something? [17:36:39] ah no, cloudera has one [17:36:40] ok [17:36:45] no not tyet [17:36:47] so I should be able to add a storm repo or something? [17:37:02]