[06:39:03] jeremyb: I didn't even know this channel was still active [06:40:00] Nemo_bis: errrr, look at the 21st? [06:41:11] (of november) [06:41:17] * jeremyb sleeps [12:44:17] gooooood morning! [12:44:18] :) [13:45:57] morning! [13:47:27] hey ottomata ! [13:47:37] ottomata: greetings [13:48:41] hihi [14:30:24] morning milimetric, ottomata1 [14:30:37] how was the thanksgiving weekend? [14:30:37] good morning drdee [14:30:45] who's ottomata1 [14:30:46] ba! [14:30:59] pretty awesome, Cornell killed UofM in hockey [14:31:06] : [14:31:07] :D [14:31:14] yo ottomata [14:31:18] ohhhh [14:31:19] yoyo [14:31:22] you are yourself again [14:31:23] man, tg was so great [14:31:24] i am [14:32:08] good to hear, how many turkeys were killed? [14:32:47] average_1rifter morning as well! [14:32:50] i killed no turkeys personally [14:32:56] and I do not have data on the national statistic [14:33:06] can't we count that using kraken? [14:33:11] however, over the last week, I probably consumed turkey meat originating from 3 turkeys [14:33:34] ha, so you are co-responsible for 3 dead turkeys :D [14:33:36] I also played the most interesting game of Risk ever, and won! [14:33:41] I suppose so [14:33:47] how many hours did it last? [14:33:52] Trader Joes gave me a free one, because the first one was frozen [14:33:57] hm, 4? [14:34:01] yeah, probably 4 hours [14:34:04] not bad i think [14:34:18] I won the family Risk trophy for the year [14:34:24] i've played games that would last 8-10 hours [14:34:26] nice! [14:34:49] yeah [14:34:52] everybody just building up armies, and nobody wants to attack becomes it makes you vulnerable [14:35:20] where'd you start from and what was the strat? [14:35:47] I always love conquering out of Australia [14:36:14] hey about the new webstatscollector, can you setup a rsync to copy the files to dataset2 and with url dumps.wikimedia.org/other/pageviews-new/ ? [14:36:36] then i will send out an email announcing this 2nd version and asking people to help validate the data [14:36:49] ask for feedback on how to define the page view metric [14:37:01] and tell them what neat features you guys introduced in this version [14:37:22] then, hopefully, sometime january we can turn off the old version of webstatscollector [14:37:36] drdee: morning ! :) [14:37:41] yoooooooO! [14:37:48] :) [14:37:53] what's with the 1 in your handle? [14:37:57] yup, australia [14:38:06] drdee: irssi puts it if I have 2 instances open [14:38:16] one irssi instance died now, so it's just me now [14:39:44] k [14:41:12] drde, maybe you want pagecounts-new/ since the others are pagecounts-ez and pagecounts-raw [14:41:13] ? [14:42:09] also, do I need to do a regular rsync for this? right now collector is just kinda hacky and running in a screen on an26 [14:42:17] can I just rsync what is currently there? [14:42:25] drdee^ [14:42:52] yes you are right about the url [14:43:07] and yes this should be regular rsync [14:43:31] i mean, does it have to be a job? people can compare with the data that has been collected, right? [14:44:31] i'd rather script this after it is set up for good [14:45:05] i don't really care about the exact implementation as long as we regurarly coy the data [14:45:41] but whyyyyy do we need regularly right now? we are just asking people to verify that the data is correct, right? [14:45:53] they just need a few days to compare this method vs. the old [14:49:38] folks like tilman want the data every day for their blog pageviews [14:49:58] and i don't exactly know when we will be able to make the switch permanent [14:54:54] avrage_1rifter: what's the status regarding the wikistats bug? [14:59:07] hm, ok hm [14:59:08] ok but [14:59:18] we should compare for ourselves first, eh? [14:59:23] looks to me like there are some differences [14:59:29] http://dumps.wikimedia.org/other/pagecounts-new/2012/2012-11/ [14:59:33] http://dumps.wikimedia.org/other/pagecounts-raw/2012/2012-11/ [15:00:00] yes i've asked erik z to look into this, and me and stefan will also do some more checking today [15:00:13] i think the new stuff counts a lot more domains than the old [15:00:45] i think you are right, i think it's counting a bit too much :) [15:00:52] oh ? [15:00:59] i will work on it with stefan today [15:01:23] well, actually, i dunno [15:01:31] there are fewer lines in the one project count file i'm looking at [15:02:09] hm ,yeah, i dunno [15:03:09] I am still looking at the bug in wikistats [15:03:15] for starters, i am not sure if 'blog' is accurate, it seems to be the wikimediafoundation wiki site, not the blog site [15:03:43] drdee: can you give me a link please so we can look at the same thing ? [15:03:55] drdee: I mean which of the .gz files are you lookin at ? [15:05:08] mornin [15:06:37] http://dumps.wikimedia.org/other/pagecounts-new/2012/2012-11/ [15:06:40] and then the first file [15:17:45] * average_1rifter brb 20m [15:48:56] yargh, i keep getting disconnected! [16:04:09] relocating, brb in 20min [16:09:21] back [17:58:53] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:58:57] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [18:49:39] ottomata: hey [18:49:50] ottomata: can you help me reproduce a problem please ? [18:50:17] ottomata: we need to do a second run of collector and udp-filter but this time having squid logs recorded somewhere so I can reproduc some problems I'm having [18:58:39] hiyea [18:58:46] eating lunch, but real quick [18:58:50] what we should do [18:59:00] is take a days worth of sampled squid files [18:59:23] and write a script to output that file in approximately real time [18:59:29] that way you have a reproduceable source [18:59:33] and compare both versions [19:03:48] average_1rifter ^ [19:23:24] ottomata: yes [19:23:29] ottomata: how can we do that please ? [19:28:28] dunno, write a script? :p [19:28:33] i can give you a file [19:29:16] you could make it output any lines that match the current minute, or something [19:29:26] since they are daily rotated files [19:29:59] ottomata: yes please [19:30:26] easiest hacky thing: while true; do grep $current_minute $file; done [19:30:35] but you could make it smarter for sure [19:30:37] ok [19:30:49] you have access to stat1, right? [19:31:15] sampled log files are at /a/squid/archive/sampled [19:33:04] changing locations, be back on in a bit [20:03:17] average_1rifter, how else can I help? [20:06:25] ottomata: I'll go on build1 and try out some things with the squid logs. Will run the old and new udp-filters there [20:06:29] and the collector [20:06:57] ottomata: I'm currently trying to set up the udp-filters build on jenkins [20:07:16] ottomata: remember when we talked that we should make it so that jenkins builds udp-filters with libanon and libcidr which were also built on jenkins ? [20:07:19] ottomata: well [20:07:35] ottomata: export LIBRARY_PATH=`pwd`/../libanon/src/.libs/:`pwd`/../libcidr/src/; export C_INCLUDE_PATH=`pwd`/../libanon/src/:`pwd`/../libcidr/include/; export LD_LIBRARY_PATH=`pwd`/../libanon/src/.libs/:`pwd`/../libcidr/src/; [20:08:08] ottomata: this should work (it works on my machine without libanon0 libanon0-dev libcidr0 libcidr0-dev packages) [20:08:14] ottomata: on jenkins, but it doesn't work [20:08:31] ottomata: can you please help me to find out why this doesn't work on jenkins ? [20:08:40] ottomata: could I have temporary access to jenkins ? [20:08:49] average_1rifter: better bet is to ask hashar [20:09:02] he is the jenkins guy [20:09:04] what's the error? [20:09:13] yeah :) [20:09:29] average_1rifter: where is the console output ? :-] [20:10:19] hashar: https://integration.mediawiki.org/ci/job/udp-filter/19/console [20:10:34] hashar: hey [20:10:36] hashar: 19:55:44 src/udp-filter.c:45:21: fatal error: libcidr.h: No such file or directory [20:10:40] this is the error [20:10:53] hashar: but above I wrote to ottomata that I added a shell command specifically for gcc to find libcidr.h [20:12:19] hashar, I can maybe figure this out if I can try to compile via shell on gallium [20:12:26] do you know where the git repo is cloned there? [20:12:43] ottomata: moment, I have an answer on that I think [20:13:09] sorry been multitasking a biit too much [20:13:20] /var/lib/jenkins/jobs/libanon/workspace/src [20:13:23] /var/lib/jenkins/jobs/libcidr/workspace/src [20:13:25] I think that's the problem [20:13:35] I need one more ".." in the relative paths I wrote above [20:13:41] I think, I'm not sure [20:13:41] ottomata: I keep losing the link to the authoritative zero partner ip ranges [20:13:45] can you pass that on again? [20:13:55] average_1rifter: or clone both lib in udpfilter workspace [20:14:12] we have ton of disk space so duplicating clones is not much of an issue [20:14:33] erosen, ottomata, done [20:14:45] erosen: https://office.wikimedia.org/wiki/Partner_IP_Ranges [20:14:56] danke [20:15:22] hashar: we can do that too but, if I clone all of them in the same place then I'll need to have just one job for all [20:15:32] hashar: whereas I already have libanon and libcidr builds working properly [20:15:42] hashar: I just need to make udp-filters use the stuff in those dirs [20:15:55] hashar: like using libcidr.h and libanon.h and libcidr.so and libanon.so [20:16:27] in my experience you are better with one job per task [20:16:32] aka one job for each library [20:16:40] and a job for udpfilter that clone the required libs [20:16:44] from master for example [20:17:09] yeah, but that would mean that the udp-filter job would recompile the libcidr/libanon stuff [20:17:10] hashar: I will use your advice :) [20:17:25] whenever you build udpfilter, it will rely on the libs workspaces which might be in some random state [20:17:29] i think better would be to make udp-filter install the -dev packages [20:17:41] for libanon and libcidr [20:17:43] for example, libanon workspace could have a broken change applied to it :-] [20:17:49] since udp-filter assumes that those are available for building [20:18:12] or even, install the -dev packages inside the udp-filter build dir [20:18:14] ottomata: building udp-filters does not necessarily need any packages, just shared objects and headers [20:18:19] ottomata: the udp-filter would recompile if the lib changed I guess [20:18:30] ottomata: I think if we will install packages every time, then that would entail having root rights [20:18:32] dpkg-deb -x libanon-dev ./ [20:18:34] or something [20:18:41] ottomata: and I am not sure if jenkins has sudo [20:18:53] no sudo for jenkins :-] [20:18:54] that's true, but you could unpack the .debs locally [20:19:03] probably better not to have udp-filter reach outside of itself into some other build project [20:19:19] why don't you just rebuild the libs? Do they take sooo long to compile ? [20:19:23] yes, that is true, but all files needed are available without making the packages [20:19:43] not really, but that makes udp-filter builds depend on libanon/libcidr builds, rather than libanon and libcidr packages [20:20:28] so, you'd need to recompile libcidr every time you compile udp-filter, [20:20:34] but if you just depended on the package [20:20:46] then the lib files are already created and NOT as a part of your build process [20:20:47] right? [20:21:35] ottomata: currently libs for libcidr and libanon are built without creating the packages themselves libanon0 libcidr0 libanon0-dev libcidr0-dev [20:21:53] I am not sure you need to recompile libcidr everytime [20:22:05] libcidr does not need to be compiled every time [20:22:26] I mean.. it's quite stable [20:22:32] in the udpfilter job you can clone libcidr, build it. If it does not get updated make is probably not going to recompile it unless you want to run make clean [20:22:35] and [20:22:46] if it is fast to compile, you can just recompile everything :-] [20:22:58] though that is a bit a waste of power and cpu cycle [20:23:06] I will try both solutions [20:23:20] hashar: you are pro building everything in one job right without making packages , right ? [20:23:38] ottomata: you are pro building packages and extracting the files from packages and then using them to build udp-fitlers right ? [20:24:11] ahhh, whatever hashar says is probably best [20:24:12] ottomata: addendum , you also suggest having separate jobs right ? [20:24:15] i'm making this up as I go :) [20:24:20] i do think separate jobs is better [20:24:21] :) alright [20:24:43] screw the packages :-] [20:24:50] it is just adding an extra layer ;] [20:25:12] average_1rifter: are the libanon / libcidr working ? [20:25:22] hashar: the builds are working ok yes for those [20:25:25] nice [20:25:29] so keep them independant [20:25:39] and forget about them unless theses jobs complains [20:26:10] if you don't really care of versions dependencies, udpfilter should just clone the lib it requires [20:26:20] in some subdirectory in the udpfilter workspace [20:26:27] like $WORKSPACE/libs/ [20:26:41] then compile them [20:27:08] then do the magic gcc command that will make udpfilter to look for the compiled libanon / libcidr in $WORKSPACE/libs/ [20:27:32] that also mean that udpfilter could break because one of the lib introduce a change (since libs are always compiled from master) [20:27:48] but you could well build udpfilter using a lib at some given tag [20:28:01] hashar: I could make this work as you say and I am going to follow your advice [20:28:11] hashar: but ottomata raised an interesting point about packages and dependencies [20:28:24] hashar: is jenkins in any way Debian-aware ? [20:28:48] hashar: meaning => can Jenkins be geared towards building .debs and installing them and that sort of thing ? [20:29:02] Jenkins is just a job controler [20:29:09] a job could contains like anything you want :-] [20:29:22] ok, that was my understanding also [20:29:28] including triggering a nuclear weapon, preparing coffee or whatever :-] [20:29:38] :) [20:29:38] you could definitely build packages [20:30:09] one way would be that whenever libanon compile + test is a success it could trigger a job that would build the deb package [20:30:13] that's really what we are going for here, right? [20:30:15] using some of the magic commands from debian [20:30:21] we can't really use any of these things unless the .debs are created [20:30:30] and, these projects shoudl ahve debian/ dirs available already [20:30:38] there is a bug somewhere asking for Jenkins to build packages [20:32:18] average_1rifter: apparently the Jenkins git plugin let you use multiple source control management [20:34:54] hashar: oh that's cool, I'm trying it out [20:35:26] ottomata: there's two ways of looking at it [20:35:39] ottomata: IMHO [20:35:50] ottomata: one way is looking at jenkins as a CI tool that builds and runs tests [20:36:06] ottomata: and the additional one is to look at it as also a way of making debs automatically [20:36:30] should we strive for both, or just the first ? [20:36:56] do the first :-] [20:37:03] once it works, start building debs :-D [20:37:11] +1 do the first [20:37:12] i guess so, but isn't a working .deb part of the whole build process? [20:37:18] we want to test that a .deb is produced properly [20:37:38] and the .deb are probably going to be hosted in git under operations/debs/*** [20:37:43] so that might be another job [20:38:11] triggered whenever the lib has been successfully compiled [20:38:20] operations/debs is only for packages where we don't modify anything upstream, but only create the debian/ dir [20:38:29] oh [20:38:33] afaik [20:38:35] in the grand scheme of things yes, but let's make our lives not too difficult i really want to get webstatscollector running smoothly, sweet packaging is secondary priority [20:38:43] lots of our own packages have debian/ dirs that come with the source [20:38:50] so the debian/ dirs are not in operations/debs [20:38:58] ottomata: that makes sens [20:39:04] so you need a different workflow [20:39:17] anyway I definitely recommend to have the build + test jobs to run [20:39:17] drdee, ha, i guess so, but it won't be deployed unless there are .debs [20:39:33] true, but we can build the debs manually on buil1 [20:39:35] a .deb building job could be compleetely spearate [20:39:38] then figure out later when,where, how to build the .deb [20:39:53] i agree though, this should work before you try to build the debs :p [20:40:20] you could have a job that compile some unstable debs that will be build whenever the libs have been compiled/tested successfuly [20:40:43] and another job that will only kick in whenever a tag is pushed, thus providing stable packages [20:42:19] yeah that'd be cool [20:48:05] ok, solved libanon and libcidr [20:48:11] paths were a bit more complicated [20:48:13] but solved [20:48:14] nooow [20:48:15] src/udp-filter.c:47:19: fatal error: GeoIP.h: No such file or directory [20:48:43] libgeoip-dev libgeoip1 need to be installed on jenkins [20:49:06] I have to remember where was the gerrit ottomata made for openssl [20:49:15] looking [20:51:13] https://integration.mediawiki.org/ci/job/udp-filter/24/consoleText [20:53:38] that openssl wasn't actually an openssl problem, it was a pkg-config, libpcap-dev problem [20:53:41] but I can add geoip [20:53:59] ottomata: thank you [20:54:00] :) [20:54:09] that would be awesome, it's the last step [20:58:39] I am off for conf calls [20:58:47] average_1rifter: feel free to mail me at hashar at free dot fr [20:58:56] average_1rifter: though I have a veryyy busy week [20:59:01] will try my best to answer / give tips :-) [20:59:25] hashar: thanks :) I will e-mail [20:59:40] I'll report back about the build process of udp-filters on jenkins [21:03:50] average_1rifter, geoip is installed [21:05:22] YAY! https://integration.mediawiki.org/ci/job/udp-filter/25/ : SUCCESS [21:11:02] great success [21:11:07] yes ! [21:11:10] build is succesful [21:11:14] and tests are now passing [21:11:29] ottomata: thanks :) [21:11:40] yup! [22:04:03] hey analytics, quick one – is there some obvious place where I could get raw daily pageview data for enwiki for the last two years? Other than crunching the Domas dumps myself, that is [22:06:42] btw http://dumps.wikimedia.org/other/pagecounts-raw/ is missing a link to the 2012 folder [22:07:01] hmm, 2012 exists [22:07:06] do you know who maintains those index files? [22:07:09] or who puts those files there? [22:07:41] Domas wrote the script that generates the dumps [22:07:45] right [22:07:48] you want daily rather than the hourly ones that are there now? [22:07:52] I guess ariel maintains it? [22:08:21] there is this: [22:08:21] daily and enwiki only, since at least August 2011 [22:08:21] http://dumps.wikimedia.org/other/pagecounts-ez/ [22:08:45] doesn't have daily, but monthly: [22:08:45] http://dumps.wikimedia.org/other/pagecounts-ez/monthly/ [22:08:54] ah, but that isn't just enwiki [22:08:59] I thikn you ahve to crunch it [22:09:19] darter we maintain it :) [22:09:20] yeah, I was trying to figure out if you guys had this data handy for any existing dashboard [22:09:23] but! if you want to try your hand at Pig, I am happy to load the hourly ones into hadoop for you :) [22:09:35] yo drdee [22:09:41] whaazzzup [22:09:50] what exactly do you need? [22:09:56] ottomata: happy to give it a try [22:10:03] > raw daily pageview data for enwiki for the last two years? [22:10:15] (at least since Aug 2011, ideally older) [22:10:45] yeah pig is probably the fastest [22:10:55] we have domas data since 07 iIRC [22:11:24] context: http://toolserver.org/~dartar/reg2/#g3 [22:11:53] DarTar, do you want the projectcount or the pagecount files? [22:11:55] traffic is the main driver of these variations [22:12:10] i think maybe you just want project counts, unless you care about specific URLs [22:12:12] projectcount [22:12:12] i guess the project count [22:12:13] oh but wait [22:12:27] yeah, project count for sure [22:12:29] project count will let you grep out a line like: [22:12:29] en - 10443851 279587739618 [22:12:34] per hour [22:12:45] (dunno what those numbers are, but they should work :p ) [22:12:52] yeah that sounds like the fastest way to get it [22:13:02] page views and total traffic size [22:14:21] drdee: how's the Nigerian princess doing? [22:15:54] she is doing awesome, she secured my retirement fund :D [22:16:31] high five [22:17:34] it's illegal though, and morally unacceptable, to start paid work before the age of 3 [22:18:00] it's not work, it's charity [22:18:37] good to hear you guys did your homework [22:23:20] DarTar, we didn't start keeping projectcount files until 2010-05 [22:23:26] so that's as far back as you'll get right now [22:23:29] s'ok? [22:24:11] yeah, even just from Aug 2011 on would be a great start [22:25:06] okeyyyyy dokey [22:25:11] have you used hue or hadoop yet? [22:25:44] no, I'll need some help to get started [22:26:21] ok cool, i need to give you an account, one sec [22:27:17] this traffic data stuff is actually not an urgent request, no worries if you can't do this now [22:27:20] naw, i'm on it [22:27:25] k cool [22:27:31] getting ready to head out ofr the day, so this'll be my last little diddy [22:28:18] DarTar: "having it handy" would be https://bugzilla.wikimedia.org/show_bug.cgi?id=42259 I guess? [22:28:58] Nemo_bis: that's a request for a public API afaik [22:29:03] which would be awesome [22:29:51] I know many researchers who would bribe Analytics to work on that [22:30:50] Nemo_bis: not clear if that request is per article or per project [22:31:14] ah cool, yeah that is probably a ways down the road though [22:31:27] re queryable db [22:31:27] sounds to me like they are asking per article [22:34:21] created a card so I don't forget: https://trello.com/c/uFa73tTQ [22:36:15] ok, DarTar [22:36:24] step 1. go here and follow the instructions: [22:36:36] analytics1001.wikimedia.org [22:36:50] (specifically, add that line to your /etc/hosts file) [22:37:40] lemme know when you've done that, DarTar [22:37:55] cool, 1 sec [22:39:23] k I'm in [22:39:32] ok, cool [22:39:33] try to go here: [22:39:34] http://hue.analytics.wikimedia.org/filebrowser/view/wmf/raw/projectcounts-raw [22:39:58] login needed [22:41:34] are you in hue? [22:41:41] logged in? [22:42:11] DarTar: do you have a hue account? [22:42:27] ottomata is setting me up [22:42:38] awesome [22:43:14] drdee, have you ever had people change their own pw? [22:43:28] no [22:43:35] sorry [22:43:43] i mean yes [22:44:06] DarTar can't access his profile cause he doesn't have perms [22:44:09] not sure what perms to give him [22:44:18] anyway, DarTar, let's figure that out later, I gotta run soon [22:44:20] so [22:44:23] sure [22:44:24] since you are logged in [22:44:26] you should be able to see [22:44:40] http://hue.analytics.wikimedia.org/filebrowser/view/wmf/raw/projectcounts-raw?file_filter=any [22:44:48] i'll take carefo that [22:44:50] hold on [22:44:56] yes [22:45:30] AH poop [22:45:38] the way I loaded those files won't work [22:45:49] yeah I realized that ;) [22:45:52] hmm, see, hadoop likes big files, not small ones [22:46:04]