[13:20:49] Change merged: Erik Zachte; [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/71785 [13:21:07] Change merged: Erik Zachte; [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/71786 [13:39:29] ottomata whoop there he is, whoop there he is! [13:46:08] whoop whoop [14:01:35] haha, IRC clients suck monkeys [14:02:21] welcome milimetric to a brave new world [14:02:39] now, I read A Brave New World [14:02:45] so that statement thoroughly scares me [14:02:55] what happened, WMF is offering mandatory medication? [14:08:57] sort of, you have to hangout with me 40 hours a week [14:12:33] hangout or 'hangout'? [14:49:06] heya drdee [14:49:15] can you help me figure out why I only have _total and _misc here: [14:49:16] http://ganglia.wikimedia.org/latest/?c=Analytics%20cluster%20eqiad&h=analytics1006.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [14:49:21] for packet_loss_average [14:49:23] sure [14:49:29] i'm checking rolematcher and logtailer stuff [14:49:43] i run rolematcher.py on the packet-loss .log file on an06 [14:49:45] and see a buncha more roles [14:49:52] than i see in ganglia [14:50:06] and rolematcher.py is working? [14:51:11] looks like it [14:51:30] mmmmmm [14:51:33] i should modify PacketLossLogtailer.py so it has a __main__ method [14:51:35] for debugging [14:51:56] yeah that sounds reasonable [15:01:41] hmmmmmm [15:01:41] 2013-07-03 15:01:35,312 WARNING Parsing exception caught at 271: regmatch or contents failed with 'all_roles' [15:02:13] that might be ok [15:02:23] i do see lots of metrics submitted [15:02:41] Oh, now I have tons more metrics, now that I ran it manually [15:02:42] hm [15:03:39] drdee, why are these different role? [15:03:40] s? [15:03:50] packet_loss_average:eqiad_mobile_cp:1046-1060 [15:03:50] packet_loss_average:eqiad_mobile_cp:1041-1044 [15:03:56] http://noc.wikimedia.org/pybal/eqiad/mobile [15:04:27] 1041-1044 are the old mobile varnishes (not running atm) [15:04:37] 1046-1060 are the new mobile varnishes [15:11:06] hey drdee, I'm still confused about this labs thing [15:11:23] you said you had figured it out, maybe you can explain [15:11:41] i did read the Tools help page now [15:12:42] ok, but still, shouldn't they be the same role? [15:16:43] milimetric: follow instructions on https://mingle.corp.wikimedia.org/projects/analytics/cards/753 [15:17:18] ottomata: maybe, but the former group of varnishes is no longer active [15:17:41] and i think i wrote role matcher in such a way that it groups subsequent numbered servers to the same role [15:18:04] these new numbers are all over the place so will probably create three new roles [15:18:16] (because that used to be a stable rule of thumb [15:19:29] hmmmmmmmmm [15:19:42] aren't the different pybal pages enough? [15:23:56] drdee [15:24:00] sure [15:24:01] i think that this [15:24:02] Parsing exception caught at 271: regmatch or contents failed with 'all_roles [15:24:07] is causing the main metrics not to be reported [15:24:13] k [15:24:20] weird because it used to work [15:24:26] but you are probably right [15:24:27] the unnamed metric without the role suffix [15:24:48] hm, ok drdee, but are we allowed to install our own redis instance? They say to use the common one and prefix the keys but we don't have that option since Celery is using it on our behalf [15:26:38] milimetric: i don't see why that would be an issue but maybe talk with Ryan_Lane [15:26:47] k [15:31:34] i'm lost again drdee [15:31:40] tools-login is some central place right? [15:31:44] it's not my own personal vm... [15:32:21] connect to bastion.wmflabs.org [15:32:24] using ssh [15:32:28] right, no i logged in [15:32:36] then ssh to tools-login [15:32:38] i just can't understand how this is set up [15:32:48] ask Coren / Ryan_Lane [15:32:53] well, yeah, so once on tools-login, should I just install wikimetrics? [15:33:02] k [15:33:03] just enter 'become wikimetrics' [15:33:11] i'm saying i did all that [15:33:15] that's basically sudo -U local.wikimetrics [15:33:18] but i don't comprehend how to use it [15:33:24] like, is this a VM? [15:33:26] and then connect to mysql [15:33:27] is it mine? [15:33:31] it's shared [15:33:44] but shared and anybody can do whatever they want on here? [15:33:48] yes it's a VM, all instances in labs are vm's [15:34:01] but this particular instance is just one single vm? [15:34:06] dunno [15:34:11] ok [15:34:11] ask Coren / Ryan_Lane [15:34:14] i'll ask [15:34:30] k [15:35:42] yeah, drdee, i'm still figuring this out, but in the effort of reducing the number of metrics in ganglia, do you mind if I remove the numbers from the ganglia role group? [15:35:46] that rolematcher returns? [15:35:53] don't mind at all [15:36:03] basically just [15:36:07] if match: [15:36:08] return True [15:36:08] right? [15:36:29] uhhhhh [15:36:32] or should I just keep the __eq__ method the same and just return self.fole [15:36:35] self.role [15:36:35] always [15:54:37] drdee [15:54:37] https://gerrit.wikimedia.org/r/#/c/71817/ [15:54:39] s'ok? [15:54:47] looking [15:55:40] +1 [15:55:47] danke [16:01:27] drdee: did you ever figure out the wiki on which Yuri was storing the carrier metadata? [16:02:29] you mean https://wikimediafoundation.org/wiki/Mobile_partnerships ? [16:02:45] yuri has some kinda of metadata page [16:02:47] no, it was like a special namespace on mediawiki or something [16:02:51] which had json schemas [16:02:55] mmmmm [16:03:04] i asked him once and he said that such a thing did not exist [16:03:31] weird [16:03:36] he showed to it me once [16:03:39] I'll just chat him directly [16:03:52] ottomata, drdee: do you remember his irc handle [16:04:02] yurik [16:04:03] i think? [16:04:06] yup [16:04:06] ya [16:04:07] danke [16:04:08] ja [16:06:23] drdee: ottomata: http://meta.wikimedia.org/wiki/Zero:410-01 [16:06:26] ok, walking to office [16:06:28] see you laters [16:07:29] cool erosen [16:10:42] ottomata: can you draw me a diagram of all the components / programs of our data flow from the moment that varnishncsa emits a log line to the moment it has been written on HDFS? [16:11:32] current flow? [16:11:50] do you want machine/service level or more abstract [16:11:57] like 'udp2log kafka producer' [16:12:51] every daemon, script, cronjob, service that is running and can potentially break [16:13:12] more detailed is better, and for the current setup [16:14:27] like, every instance? [16:14:28] each data flow [16:14:29] ok [16:16:05] like varnishncsa -> udp2og -> shard script -> kafka producer etc etc (and one flow for kraken related flow and the other one for sampled data flows on gasolinium) [16:18:55] oh those too [16:18:56] but i mean [16:18:59] you want [16:19:42] varnishncsa -> gadolinum socat multicast relay -> analytics1006 udp2log kafka_producer -> kafka brokers an21 & an22 -> kafka hadoop consumer (cron job on an02) -> hdfs [16:19:43] ? [16:19:56] that's the webrequest-wikipedia-mobile flow [16:20:00] there'd be 3 of those [16:20:46] yup tx [16:20:57] hmm, oook [16:21:16] I'd use omnigraffle and generate a pdf, unless you have a better graphy thing [16:21:23] suggestions? [16:21:26] this is fine [16:21:47] k [16:22:40] the sharding script is gone? [16:23:51] yeah no more awk sharding [16:23:55] it was just awk [16:23:59] not really a script [16:24:00] but yeah [16:24:41] can it fail? [16:25:02] what are you using now/ [16:25:02] ? [16:28:38] what exactly is the udp2log kafka producer? [16:30:51] ottomata: can you help me fill out the table in https://mingle.corp.wikimedia.org/projects/analytics/cards/789 [16:31:24] yes it can fail [16:31:27] it is the same as before [16:31:30] except only one host instead of 4 [16:31:57] it is using this wrapper [16:31:57] https://github.com/wikimedia/kraken/blob/master/bin/kafka-produce [16:32:20] so that should be added to the table on card 789 as well? [16:32:31] or is that the udp2log kafka producer? [16:32:46] sorry for asking all these q's/ [16:33:06] that is [16:33:31] otto@analytics1006:~$ cat /etc/udp2log/webrequest [16:33:31] … [16:33:31] # pipe all requests from mobile frontend cache servers into kafka [16:33:31] pipe 1 /bin/grep -P '^cp(104[1-4]|10[6-7]|1059|1060|301[1-4])' | /opt/kraken/bin/kafka-produce webrequest-wikipedia-mobile 9951 > /dev/null [16:34:25] k [16:37:07] New patchset: Stefan.petrea; "Debianization to conform with Ops requirements" [analytics/dclass] (debian) - https://gerrit.wikimedia.org/r/68711 [16:38:02] hey average [16:38:23] hi drdee [16:38:40] I adressed all of the comments in Faidon's review and pushed a new patchset for dclass [16:39:13] didn't have enough time to address 738, also had some problems with access to stat1002 which are now solved [16:39:59] we still have some stuff for the package, such as SONAME and SOVERSION and basically the proble of co-installability .. that's the only thing left IMHO [16:40:11] *problem [16:40:26] about 738: didn't EZ merge your patchset? [16:40:59] if yes, what's left to be done? [16:40:59] he did, but yesterday I got access to stat1002 after standup, and by that time I was already looking at 738 [16:41:36] ottomata, can you please help fill out the table in https://mingle.corp.wikimedia.org/projects/analytics/cards/789 ? [16:41:56] ha yes but i'm busy! but yes of course [16:42:00] you want a diagram too, right? [16:43:29] diagram i already got :) [16:44:00] 738 was affected by migration, so I have to go on stat1002 and see why the job ins't running (possible reasons: I had Perl modules on stat1 which were copied with rsync to stat1002 ; different directory structure ; missing files ; user rights/permission problems ; or maybe the cron job isn't even present.. so I have to insert it into my user's cron on stat1002 ) [16:44:49] cronjob should not run under your own account but under the stats account and should probably be puppetized as well [16:46:10] Perl modules should also be puppetized as those are now de-facto dependencies for wikistats; can you make a list of what you need? [16:46:37] yes [16:47:00] different director structure / missing files: how is that possible? [16:48:51] yes, you're right, those should not be a proble [16:49:14] but the perl modules and installing the cronjob are things to look into [16:56:23] hi, drdee, just wanted to follow-up, is there any way I could get access to a raw list (not totals) of user agent strings for the past month/week/day (e.g. a simple text file with a single user agent per line, number of lines = number of requests)? it can be sampled too [16:56:41] ottomata: your puppet stuff is soooooo good lately [16:56:58] ha, aww, thanks, whatcha referring to? [16:57:39] can a cronjob be puppetized ? if there's a previous example of this [16:57:39] was peeking at https://gerrit.wikimedia.org/r/#/c/71569/ [16:57:47] I just looked in the puppet repo and couldn [16:57:54] 't find an example of a puppetized cron job [16:58:30] average: http://docs.puppetlabs.com/references/latest/type.html#cron [16:58:49] thx [16:58:53] also, for an example [16:59:03] check misc/statistics.pp line 702 [17:13:16] ori-l, do you know where I could find raw user agent strings perhaps (my question at 9:56)? I vaguely remember you did some user agent parsing [17:14:36] jgonera: for the mobile site? [17:14:43] yes [17:15:29] on stat1, /a/squid/archive/mobile [17:15:51] oh, right, thanks [17:15:58] e.g. /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130703.gz [17:17:30] that's updated daily? [17:18:11] I remember you said that we get enough traffic for the daily data to be enough to see averages, right? [17:20:55] it's updated daily and yes, there's no difference between the sampled logs and the full logs in terms of user agent breakdowns [17:21:33] um, hang on [17:23:19] jgonera, arhg sorry [17:23:24] ori-l is right [17:23:36] yeah, but I mean, should I just parse one day to know if X% of our users use iPhone 3 or I should look at a week of data [17:23:38] jgonera: the logs from 06/27-today seem off [17:23:48] but yes, that's adequate [17:23:54] i would go for one week [17:24:11] I guess I should take a week into account to be sure that weekend/holiday usage differences don't influence the stats? [17:24:17] I see [17:24:18] jgonera: yeah [17:24:30] and stick to data from before june 27th [17:24:56] i think i still have python code for that somewhere on stat1 if you want it [17:25:20] hm, I actually thought about adding this to our mobile dashboard at some point (not very soon), I guess this will get fixed soon? [17:25:43] oh, wait, now I remember, I used that too, I must have a copy of your code in my home dir [17:25:51] I guess I'm getting old ;) [17:25:55] if you want to dot then it's probably faster to have a job on on kraken and run that regualry [17:26:06] if you want, we can spec a card together [17:26:41] drdee, if we could do it, that would be great. I'm just not sure yet how we want to parse the users strings and how we'd like to group different device/browsers together [17:26:47] I'll play with this a bit [17:26:50] thanks guys [17:27:02] we can always port your business logic [17:27:08] so please keep me posted [17:27:35] we are using openddr for device class detection, so please do save yourself a lot of headache by writing your own regexes [17:27:45] we can also pull the regexes from wikistats [17:28:34] we already have very similar jobs running for tomasz to detect wikipedia official and non official apps [17:28:57] and we have an external contractor ready to get his hand dirty with writing more of these jobs [17:29:58] drdee, filling out your table [17:30:05] ty sir [17:30:12] are the Ganglia and Icinga columns for exisiting stuff, or for this new canary event? [17:33:44] existing [17:35:06] drdee, I won't write the regexes myself, I remember I used a Python or Ruby lib once that did it pretty well, I'll see how well OpenDDR handles some corner cases like modded Androids and such [17:41:04] k [18:02:35] where is webstatscollector? [18:05:12] where is it running? or where is the source code? [18:08:15] tnegrin: ^^ [18:08:31] running [18:13:41] on gadolinium, that's the name of the box [18:29:32] average around? [18:33:41] drdee: yes [19:32:52] average still around? [19:32:56] come back to meeee [19:39:48] milimetric, erosen: i took a very rough first stab at writing an epic for the data explorer, see https://mingle.corp.wikimedia.org/projects/analytics/cards/788 [19:39:59] please have a look and shoot it down [19:41:03] :) [19:41:12] will do drdee [19:41:26] ty [20:04:04] milimetric, erosen: privacy meeting wikimetrics [20:04:10] https://plus.google.com/hangouts/_/8e1ba17e5c53eaeaa39755c8d2bd26fc2fae9fde [20:04:22] oops! [20:15:38] New patchset: Ottomata; "Debianization to conform with Ops requirements" [analytics/dclass] (debian) - https://gerrit.wikimedia.org/r/68711 [20:17:34] New review: Ottomata; "Getting there. Let's work on fixing as many of these Lintian errors and warnings as we can. Just g..." [analytics/dclass] (debian) - https://gerrit.wikimedia.org/r/68711 [20:17:55] New review: Ottomata; "These 4 are important:" [analytics/dclass] (debian) - https://gerrit.wikimedia.org/r/68711 [21:30:52] drdee: ottomata: want to check out the latest dashboards? [21:31:07] e.g http://gp.wmflabs.org/dashboards/orange-kenya [21:31:41] k looking [21:31:43] fingers crossed! [21:31:45] http://gp.wmflabs.org/dashboards/orange-tunisia [21:31:56] ottomata: be warned that one is a little scary [21:31:58] but I think it is okay [21:32:10] no data for may 29ish? [21:32:13] if you check the percent graph it shows that the jump in June isn't system-wide [21:32:16] yeah [21:32:46] why no may 29ish data [21:32:48] wha? [21:32:55] actually that is my fault [21:33:00] i was removing the 1st [21:33:36] oh phew, ok [21:33:45] why remove the 1st? [21:33:56] it was a hack not to show all of june [21:34:02] hm ok [21:34:04] but I'll modify that and redeploy [21:34:16] but the parts you were worried about, they look alright, right? [21:34:30] well, mostly i was worried about deduplicating the may 25ish data [21:34:35] ]yeah [21:34:35] and it looks about right, right? [21:34:48] yeah [21:34:51] it looks around the level of the 1st week in june [21:34:58] this graph could be useful, too [21:34:59] http://gp.wmflabs.org/graphs/free_mobile_traffic_by_version [21:35:07] that has all of the carriers summed together [21:35:15] ah yes [21:35:25] that is good [21:35:28] zero out all counts prior to launch (which makes things a little complicated) [21:35:50] not sure what's up with june 12 [21:35:54] but I guess I have a question about May 28, 29, 30 31 [21:36:05] is that what you were referring to? [21:36:13] I only removed the data for June 1 [21:36:21] yeah, why is there a gap there? [21:36:51] no idea [21:36:54] i think it is in tsv? [21:36:57] (checking) [21:37:10] i looked for may 29 00.00.00 [21:37:11] its there [21:37:57] but in the tsv it's not [21:38:14] two lines: 5/27/13 zu wikipedia.org Z US total-access-(dtac)-thailand 1 [21:38:15] 6/1/13 aa wikipedia.org M CI orange-ivory-coast 1 [21:38:24] consecutive lines, that is [21:38:39] wha, really? [21:38:44] it should be coaleced [21:38:44] hm [21:39:03] oh worryu [21:39:04] sorryu [21:39:07] i was looking at raw data [21:39:14] ja, i knew that [21:39:14] that just means that the jobs aren't run for those dates, hm [21:39:22] oooh [21:39:24] i didn't know that [21:39:25] sorry [21:39:27] raw raw [21:39:32] not just unaggregated [21:39:38] hm [21:40:50] ok hm, i guess i jsut need to run jobs for those days [21:40:51] hm [21:40:54] i gotta run now [21:40:56] but i'll check into that tomorrow [21:40:58] shouldn't take long [21:41:01] k [21:41:02] np [21:41:08] btw, cron is running on limn0 [21:41:11] every 15 [21:41:24] cool! [22:39:18] hey erosen, when/if you work on wikimetrics, are you still stuck? [22:39:24] YO [22:39:26] yo* [22:39:38] i can be back on wikimetrics now [22:39:45] i'm not pressuring [22:39:47] just asking [22:39:59] yeah, we should sync up before you finish up [22:40:14] i'm just writing a "continuity" page [22:40:18] a la Guillaume [22:42:05] milimetric: want to hangout? [22:42:18] sure