[04:52:33] New patchset: Rfaulk; "mod/add. move handlers outside of apache context." [analytics/user-metrics] (master) - https://gerrit.wikimedia.org/r/72476 [04:55:56] Change merged: Rfaulk; [analytics/user-metrics] (master) - https://gerrit.wikimedia.org/r/72476 [05:27:21] New patchset: Rfaulk; "revert. last change, meant to be in separate branch." [analytics/user-metrics] (master) - https://gerrit.wikimedia.org/r/72478 [05:27:50] Change merged: Rfaulk; [analytics/user-metrics] (master) - https://gerrit.wikimedia.org/r/72478 [06:24:52] New patchset: Rfaulk; "add. broker module to handle communication between apache runtime and handlers." [analytics/user-metrics] (master) - https://gerrit.wikimedia.org/r/72479 [06:34:21] New patchset: Rfaulk; "add. Module to handle actions among broker and api components." [analytics/user-metrics] (master) - https://gerrit.wikimedia.org/r/72481 [06:42:47] New patchset: Rfaulk; "add. broker module to handle communication between apache runtime and handlers." [analytics/user-metrics] (repair_runtime) - https://gerrit.wikimedia.org/r/72482 [06:43:15] Change merged: Rfaulk; [analytics/user-metrics] (repair_runtime) - https://gerrit.wikimedia.org/r/72482 [14:27:42] jooooooooo [14:28:04] average around [14:28:05] ? [14:32:46] morning ottomata [14:32:52] morning milmetric [14:33:10] moorning [14:33:15] drdee, just looked at that packetloss thing [14:33:25] the udp2llg -> kafka producer pipe was broken again [14:33:38] i don't know why, there are no useful logs [14:33:44] this is all I got [14:33:44] https://gist.github.com/ottomata/5949326 [14:33:48] we need to restart that automatically, is there a way to do that? [14:34:06] the thing is, none of the processes die [14:34:11] that's why udp2log isn't doing it [14:34:23] like tail grepping the log and look for 'broken pipe' ? [14:34:30] that would work, so hakcy though [14:34:35] of course [14:34:39] also, i don't know why this stream died but an09's didn't [14:34:43] (the geo+anon stream) [14:34:53] last week when this happened, it was clearly a too much data problem [14:35:00] and the symptoms were a little different too [14:35:36] actually [14:35:38] hmmmm [14:35:39] yeah [14:35:39] hm [14:36:40] also bad, is that when those grep broken pipe messages happen [14:36:43] it fills up / [14:36:48] why is that? [14:36:49] udp2log.log gets huge [14:36:55] that gets output for every message [14:36:56] logrotate? [14:37:17] maybe, its set to rotate daily now [14:37:32] i guess it could help to rotate based on file size [14:37:32] but [14:37:34] still [14:37:43] i hear ya [14:37:51] maybe I should try Ori's Udp Kafka? [14:38:49] yeah i think it's a shot [14:38:55] maybe as a third pipeline? [14:38:58] yeah [14:39:01] i'll take an05 [14:39:03] k [14:39:06] lemme make sure this one is back up.. [14:40:07] at least we are runing the zero stuf on the other data stream [14:40:09] :) [14:40:37] actually, i don't think any jobs are running on the an06 data stream right now [14:43:13] hmmmmmmmmmM [14:43:14] i am confused [14:43:16] because [14:43:33] kafka produce events didn't drop off [14:43:34] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Analytics+cluster+eqiad&h=analytics1006.eqiad.wmnet&jr=&js=&v=3004679&m=udp2log_kafka_producer_webrequest-wikipedia-mobile.AsyncProducerEvents&ti=udp2log_kafka_producer_webrequest-wikipedia-mobile.AsyncProducerEvents [14:43:38] OH [14:43:39] yes they did [14:43:44] sorry, iw as looking at the wrong timeframe [14:44:03] wait but [14:44:03] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&hreg[]=analytics102%5B12%5D.eqiad.wmnet&mreg[]=kafka_network_SocketServerStats.ProduceRequestsPerSecond>ype=stack&title=kafka_network_SocketServerStats.ProduceRequestsPerSecond&aggregate=1 [14:44:05] weird [14:48:51] yo qchris [14:48:58] Hi drdee [14:49:04] lunch time? :D [14:49:10] ori-l, you there? [14:49:11] :-)) [14:49:19] drdee: Lunch's over already. [14:49:24] tea time? [14:49:51] * drdee is in a 'funny' mood [14:49:57] drdee: :-) No, I'm just trying to get Hadoop to run locally. [14:50:14] can i help you? [14:51:05] I guess no. libprotobuf's jar is not found by maven. [14:51:33] And I just installed a 32-bit jvm to make hadoop happy (which only comes precompiled for 32-bit jms) [14:52:00] So just fixing one thing after the other, but I am not stuck yet. [14:52:09] which distribution are you using? [14:52:31] Gentoo [14:52:54] sorry i mean hadoop distribution [14:53:05] Just vanilla hadoop-common [14:53:13] To connect to a cdh4 [14:53:27] (The cdh4 is up and running) [14:53:53] But it seems it's connecting now :-) [14:55:38] New patchset: Erik Zachte; "comscore data for May 2013" [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/72524 [14:57:59] Change merged: Erik Zachte; [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/72524 [15:01:18] qchris [15:01:23] i ahve a vagrant vm that works decently well [15:01:40] i haven't done a lot of work to make sure it works for others [15:01:45] ottomata: Mhmmm. Sounds good. [15:01:50] but it should be pretty well puppetized using the same stuff we are using in prod [15:01:55] ottomata: Where could I get that? [15:02:03] https://gerrit.wikimedia.org/r/#/admin/projects/analytics/vagrant/kraken [15:02:12] i think i'll need to walk you through a few things [15:02:13] Thanks [15:02:18] I haven't used a fresh clone of that in a while [15:02:26] i'll clone it myself too :) [15:03:29] oh totally lemme push some stuff [15:04:57] ottomata: But thinking of it ... since I can now connect to my cdh4 as expected ... I guess I am set already :-) [15:05:54] oh ok [15:06:44] But I'll keep that repo nontheless in case I run into problems. Thanks. [15:07:01] ok i just pushed [15:07:03] i *(think* [15:07:10] if you run vagrant up and/or vagrant provision [15:07:15] you should end up with a working hadoop [15:37:48] milimetric: bat cave? [15:37:58] yep, brt [15:38:01] k [16:12:02] New patchset: Milimetric; "got the rest of the bytes added tests passing" [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/72533 [16:12:16] Change merged: Milimetric; [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/72533 [16:31:39] drdee: hi [16:31:47] yo [16:31:47] ottomata: can you add me to the stats group on stat1002 pls ? [16:32:02] got permission problems with the stats user [16:32:16] ottomata: also, would it be possible that I have login into the stats user as well with my ssh key ? [16:32:39] h, [16:33:33] ottomata: hi :) [16:33:45] hmm [16:34:16] it would be better i think if we modified the files so that they had group perms for wikidev [16:34:50] what happens to new files added to the git repo ? what group will own them ? [16:34:51] where does /a/wikistats_git/pageviews_reports/bin/stat1-cron-script.sh output to? [16:35:10] new files, wikidev, but they probably won't be group writeable [16:35:15] unless we make them so [16:37:12] stat1-cron-script writes to /tmp/pageviews-full-cron/ ; /tmp/pageviews-full-cron/map/ ; /tmp/cperlver ; /tmp/cperlerr [16:37:43] had to start the cron job manually inside a GNU screen on my user [16:37:50] /tmp shoudln't have a problem then [16:38:27] average, what is your current problem? [16:39:46] I cannot interact with the cronjob. Cannot stop it if there are problems. The files in the output of the above-mentioned script are not writable from my user [16:39:49] spetrea@stat1002:~$ ls -lhS /tmp/cperlerr [16:39:51] -rw-rw-r-- 1 stats stats 93K Jul 8 07:21 /tmp/cperlerr [16:39:54] spetrea@stat1002:~$ echo "test11" >> /tmp/cperlerr [16:39:56] -bash: /tmp/cperlerr: Permission denied [16:41:00] why do you need to write to those files, aren't they just log files? [16:41:31] they are. anyway, today the job was supposed to run(because we put weekday=>1 in puppet on the cronjob), and it did [16:41:51] but there was a directory which was supposed to be there and it was not there, it's /tmp/pageviews-full-cron/ [16:41:54] so I created it [16:42:17] but the job which run before that failed because that directory was not there [16:43:17] shouldn't the code create the directory? [16:43:32] it should, that's my mistake.. [16:43:36] ah ok [16:44:12] drdee, i dunno what to do here [16:44:30] ottomata: when a new gerrit patchset is merged in gerrit, is the code mirrored immediately to /a/wikistats_git ? [16:44:37] not immediately [16:44:41] whenever puppet runs [16:44:42] i think [16:44:43] actually [16:44:43] no [16:44:45] that's not in puppet [16:44:49] someone has to pull [16:44:54] oh [16:45:22] drdee: we are trying to productionize wikistats, but wikistats isn't really productionizable, it needs more development work and testing, which normally would be done locally or in labs, but we can't do that work there because wikistats needs the private webrequest files to run, right? [16:45:37] yup [16:45:49] it's a nasty problem [16:46:08] we could make a staging setup in labs with fake web request fiels [16:48:33] This particular part is productionizable. I have to add an mkdir to the cron script so the output directory is present [16:49:05] ya, but it sucks, because its super hard to test, right? [16:49:23] it only runs once a week as stats user, and you don't have privileges for that user, and it takes hours and hours to run [16:49:54] it's hard to test because I don't have access to stat1002 as stats user which holds the cron job [16:52:18] Hours & hours, yes, that's true. That's basically a relic from the dev process. Because we attempted a lot of variations on the filtering criteria, so we had to re-run it over again with all the data. [16:53:45] But normally we should just need to process the latest month. That alone is just 50m [16:57:09] hm ok [16:57:33] ok I think I can put you in the stats group [16:57:39] not sure how much that will help, but at least a little bit? [16:59:55] yes [17:00:11] ottomata, average: scrum [17:16:53] average: you and ezachte are not instats group on stat1002 [17:17:25] ottomata: can we be ? [17:17:34] sorry [17:17:35] ahah [17:17:38] that was supposed to say [17:17:39] now* [17:17:42] now in stats grou [17:17:43] so yes! [17:17:45] you can and are [17:17:46] oh cool :) [17:17:49] thank you [17:18:17] ottomata: is stats a user that's not supposed to be logged in to ? [17:18:38] right, its not [17:18:51] aand, that would be complicated, i think, i'd have to ask ops people how they woudl want me to do that [17:18:56] and they would say that they don't want me to do that [17:19:02] i could maybe give you sudo to stats permissions [17:19:06] but it would be controversial [17:19:25] oh ok, I don't want to be controversial [17:19:58] yeah, it might be doable, but i'd have to justify it real hard, and then there would be a discussion about not giving any sudo to people, etc. etc. [17:25:33] getting food, back in a bit, erosen, I'm doubling the files now... [17:25:37] will re-run jobs when I get back [17:38:35] milimetric: ping [18:00:45] New patchset: Stefan.petrea; "T1" [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/72556 [18:01:35] New patchset: Stefan.petrea; "Added creation of output directory in cron script" [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/72556 [18:02:08] Change merged: Stefan.petrea; [analytics/wikistats] (master) - https://gerrit.wikimedia.org/r/72556 [18:17:12] oh hey erosen, missed your ping [18:17:23] sry about that, what's up [18:19:58] ottomata: hey, you're back :) [18:20:20] ottomata: should we really have distribution name "wikimedia" ? [18:20:27] ottomata: for the package [18:20:50] not sure [18:20:51] um [18:21:35] i think we do [18:21:41] -wikimedia [18:21:41] so [18:21:43] for this [18:21:46] precise-wikimedia [18:22:07] ok [18:24:18] that is a lintian that probably we can ignore [18:24:30] i think we can add ignore flags to debian/rules [18:24:36] somehow [18:24:37] if you want [18:38:03] ottomata: got a sec to make a tiny security fix [18:38:15] it slipped through the cracks and csteip identified it as pretty urgent [18:41:04] ja, in ops meetings so kinda listening to that [18:41:06] wasssup? [18:41:24] sent you an email [18:41:32] super tiny fix, just gotta make something https [18:41:36] (a link) [19:22:47] erosen, doubling is taking a looong time, it is copying ~230GB [19:22:56] almost at 200GB now, so getting close [19:38:45] ottomata: great [19:46:35] milimetric: hey, got a sec to chat celery stuff? [19:47:31] yep [19:47:33] hangout? [19:47:58] erosen: ^^ [19:48:01] ya [19:54:21] ok, erosen, rerunning jobs [19:54:29] k [19:54:42] once its done i'll put data in place and try to coalesce manually [19:54:45] ottomata: ping me when they are finished and I'll rerun dashboard script [19:54:48] awesome [19:57:47] ottomata: do we have a mingle card for the udp2log puppet stuff? [20:00:05] milimetric, erosen: just spoke with YuviPanda and Coren about running wikimetrics in labs, and they came with an alternative for using mod_wsgi that is already operational [20:00:18] https://github.com/addshore/dumpscan/blob/master/dumpscan.py for an example [20:00:27] (can help with the CGI wrapper if needed) [20:00:55] Coren will continue to work on uwsgi and we can use that once it's ready but this way we are not dependent [20:01:01] thanks YuviPanda! [20:01:09] yw [20:01:19] ok cool, thanks YuviPanda [20:01:22] thanks drdee [20:01:36] erosen and I are stuck on something so we're gonna keep at this today [20:01:40] but we'll talk tomorrow? [20:01:50] k [20:01:50] sure [20:02:01] okay, that was for drdee :P [20:02:01] ok [20:02:22] you are always welcome to join our hangouts YuviPanda [20:02:37] * drdee loves party crashing panda's [20:02:50] hehe, will gladly do if there's something I can elp with toollabs [20:03:08] I've been spending weekends / 'research time' on toollabs stuff, so :) [20:04:15] drdee, i thikn we don't, i will create one [20:04:24] ty [20:04:36] 1 sec [20:04:52] is it this one: https://mingle.corp.wikimedia.org/projects/analytics/cards/736 maybe? [20:06:13] naw [20:06:16] that's different for sure [20:06:27] here you go [20:06:28] https://mingle.corp.wikimedia.org/projects/analytics/cards/796 [20:06:59] k [20:26:00] New patchset: Erosen; "adds non-functional job serialization framework" [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/72624 [20:26:17] Change merged: Erosen; [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/72624 [20:59:07] erosen, very close, i'm coalescing country now, once that's done the .tsvs will be ready [20:59:23] would you prefer me to rsync to stats.wm.org/kraken-public manually? or just grab them out of hdfs? [20:59:41] my fingers are so crossed that this will work [20:59:55] i really really really hope it did what I just intended to do [21:05:08] AGHGHGHGHHG I AM GOING TO GO CRAZY [21:05:16] udp2log kafka producer on an09 has been down [21:05:31] I HAD NO NOTIFICATIONS gAgghghhhh, [21:05:34] this thing is so hacky [21:05:36] aggghhhh [21:11:41] oh no ottomata [21:11:48] that sux [21:12:26] there are 0 logs as to why either, just logs in kafka.log about the async producer shutting down [21:12:28] sigh sigh sigh sigh [21:12:37] can we add some more logging to udp2log? [21:13:34] dunno, this is totally weird, i've been meaning to add notifications to per producer produce events, but its more complicated because the metrics are based on topic names [21:13:51] i recently got that stuff into ganglia, but I have to figure out how to parameterize the icinga alert [21:13:56] which is why I haven't done that bit yet [21:14:11] plus, every other time this has happened an alert has been triggered at the broker level produce metrics [21:14:18] k,you need help? i am pretty good in googling :) [21:14:19] i'm not sure why yet... [21:14:21] sl many mysteries [21:16:22] here's my opinion [21:16:31] debugging this setup causes more time wasted than hardening the data pipeline [21:16:38] and a Lot more stress and unreliable service [21:16:54] I suggest that we just start hardening and thinking from a "test-driven" perspective [21:16:59] indeed, possibly ori-l's UdpKafka will be better than udp2log [21:17:03] but, aside from that [21:17:11] we don't have anything until we get the kafka producers into varnish [21:17:18] so whatever updates we make, I suggest we make sure the pipeline is 100% testable and monitorable [21:17:35] yeah [21:18:01] milimetric: that's why we have https://mingle.corp.wikimedia.org/projects/analytics/cards/789 :) [21:18:50] erosen [21:18:52] you there? [21:18:55] he's not around no [21:18:59] ratso [21:19:08] the jobs are done and coalesced [21:19:16] i'm syncing to stats.wm.org/kraken-public now [21:22:40] ottomata: how goes it? [21:22:50] milimetric: want to resume hangout? [21:23:00] yep [21:23:29] heyaaa [21:23:31] it goes! [21:23:33] .tsvs are ready [21:23:39] aaaawesome [21:23:48] get them from hdfs [21:23:57] zat ok? [21:24:03] i'm trying to sync them to stats.wm.org/kraken-public now [21:24:21] sure [21:24:25] either way [21:25:18] fingers SO crossed right now [21:25:28] how long will it take you to generate new graphs (I'm probably going to peace out soon) [21:25:39] 10 min max [21:25:53] k [21:46:36] how's erosen? [21:46:40] how'sit? [21:47:07] niiiice [21:47:14] ooh [21:47:25] you mean you already uploaded the tsvs :/ [21:47:34] i haven't run the dashboard script yet [21:47:36] but will do that now [21:48:57] milimetric; erosen, you might wanna add your email address to this bug: https://bugzilla.wikimedia.org/show_bug.cgi?id=49058 [21:51:53] ok ,yeah, i need to go rreaaaal soon erosen [21:51:56] i'm so hoping they are good [21:52:04] ottomata: can we hangout 5m ? [21:52:16] ottomata: I wanna show you some stuff [21:52:26] ottomata: one sec [21:52:29] deploying now [21:52:30] i think he's busy average [21:52:34] just pulling from limn0 [21:52:35] and he's gotta leave soon [21:52:37] maybe tomorrow [21:53:11] ok [21:53:32] average you can show me [21:54:09] ottomata: can't log into limn0 to force the update (though it will happenin 6 min anywya) [21:54:51] hm uhh ok [21:55:11] can you log into limn0? [21:55:37] just pulled [21:55:37] yes [21:55:41] this one? [21:55:41] k danke [21:55:41] /var/lib/limn/gp/gp-zero-data-repository [21:55:43] yup [21:55:49] k ja pulled [21:55:54] hrm [21:56:06] did you rsync the datafiles? [21:56:36] i think not, my connection died halfway in between [21:56:38] can you get them from hdfs? [21:56:44] yeah, just didn't know [21:57:19] runnign the sync again,b ut it will take a bit [21:58:36] k [22:00:33] average, ja sorry i gotta run [22:01:11] ottomata: what is the timeline for the sync? [22:01:23] i can't seem to scp my way through the cluster right now [22:01:42] nvm [22:01:44] got it [22:03:23] ottomata: finally deploying new data from hdfs [22:03:49] k tell me when to pull on limn0 [22:05:55] ottomata: go for it [22:06:10] k done [22:06:14] now what? [22:06:28] ok [22:06:30] http://gp.wmflabs.org/graphs/free_mobile_traffic_by_version [22:06:34] looks a little high, eh? [22:06:38] looking [22:06:46] it does... [22:07:09] and then we have the weirdness on July 4 and 6 [22:07:09] hmm [22:07:16] yeah, 4th shoudl be fine I think... [22:07:21] after that i dunno [22:07:21] sigh [22:07:27] ok, sigh [22:07:46] what do we do now? i need to lookinto the july stuff tomorrow [22:07:48] don't have time now [22:07:51] yeah [22:07:58] I thnk we just hold off on sharing the dashboards [22:08:03] but i guess we can't use this for june, right? [22:08:03] i'll let amit know [22:08:08] i'll email as well [22:08:09] not really [22:08:11] k [22:11:25] alright, i'm outty [22:11:35] sorry erosen, i'll see if I can get more insight tomorrow [22:11:43] no worries [22:11:46] don't stress out too much [22:11:54] not that you usually do that [22:11:57] haha [22:12:04] k latas [22:37:03] k erosen, I'm back in hangout [22:37:09] k