[00:03:19] drdee: you about? [00:04:27] drdee: (for when you are) could you send me info about the sampled data to use for the tablet counts in #61? [00:05:00] hm. i will send an email. [00:44:26] ori-l thanks a bunch for the rsync puppet, I used it to build: https://gerrit.wikimedia.org/r/54811 [00:45:11] if that gets accepted, it'll rsync yuvi's stuff so you could probably abandon the one you set up [00:46:12] I decided to make the rsync go from stat1 to stat1001 because that way it won't have to sweat whether or not the SQL finished running [00:49:35] milimetric: were you referring to deploying any particular type of artifact? [00:49:59] no, it doesn't have to be specific [00:50:08] we didn't understand the policies altogether [00:53:55] oops, forgot to ping kraigparkinson ^^ [00:55:55] ok [00:55:58] thanks. :) [00:57:21] milimetric, what's left to do for #68? [00:59:43] i am around btw [01:00:41] 68 is now waiting on two more things [01:00:48] 1. ops approval of my puppet change [01:01:15] 2. me to point the graphs to the result of the scripts I just puppetized [01:01:57] kraigparkinson: so I'm hoping ottomata can polish up any mistakes I made with puppet and we can have that merged quickly tomorrow morning [01:02:13] and as far as #2, it'll take no more than 20 minutes as soon as 1 is done [01:02:26] so there's some risk that this won't get done, but it's sort of out of our hands [01:02:57] OK. [01:03:03] Thanks for the update. :) [01:03:26] can you update me on that tomorrow morning around 9 pacific? [01:03:33] :) [01:04:26] drdee: updates to https://mingle.corp.wikimedia.org/projects/analytics/cards/92 are in [01:04:41] drdee: i'm going to be leaving in about 30min. let me know if you need anything from me [01:04:42] yes, thanks! reading right now [01:10:37] drdee: please mail if any questions come up. im going to wind down [01:10:48] tfinc: will do [01:30:31] kraigparkinson: will do [01:30:55] thanks, milimetric [09:28:26] [travis-ci] master/b40f613 (#100 by dsc): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5651302 [10:15:09] [travis-ci] master/f314567 (#101 by dsc): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5652129 [11:12:09] [travis-ci] master/1991a3e (#102 by dsc): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5653141 [12:16:22] morning [12:18:47] * YuviPanda waves at milimetric [12:18:58] hey YuviPanda [12:19:06] you are in EDT? [12:19:16] we're waiting on the gerrit patchset to get reviewed [12:19:22] yes, I'm on EDT (EST) [12:19:43] ah, nice [12:19:48] * YuviPanda looks at analytics/limn-mobile-data [12:20:05] milimetric: is /a/ world readable? [12:20:13] i don't think so [12:20:18] hmm [12:20:18] okay [12:20:22] i think that's actually where all of the private data is stored [12:20:38] based on my poor reading of statistics.pp [12:20:51] looks like at least some raw logs are being stored there [12:20:57] oh! you reminded me! [12:21:01] i never changed site.pp [12:25:06] milimetric: ah, okay [12:25:14] milimetric: also your commits - some are as Dan, some as milimetric [12:25:29] i think i committed from stat1 [12:25:40] and forgot to git --config global user.name [12:25:48] or whatever [12:26:08] it's ok, I won't do that again [12:26:12] as vim is unusable over ssh [12:42:24] milimetric: you can scp ~/.gitconfig dan@stat1:/home/dan/ [12:49:23] thx average_drifter, I just set the username and email [13:01:09] morning [13:08:16] morning drdee [13:08:22] yoyo [13:21:48] got some big problems with git on stat1 [13:21:53] dunno what's going on [13:22:00] it just stalls [13:22:04] I hit git status, it stalls [13:24:01] I would report it on https://rt.wikimedia.org/ , but I don't have an account [13:24:39] yeah, stat1 is being crazy slow [13:24:46] vim is acting up on it too, right average_drifter? [13:25:10] milimetric: haven't tried vim, I edit locally and I push to stat1 and I run there [13:25:20] but I can't use git on stat1 [13:25:21] :| [13:25:27] milimetric: would you report this to ops please ? [13:25:51] yeah [13:25:51] I guess I'll just resort to scp -r if I can't push.. [13:25:56] wait with reporting [13:26:03] let's first try to diagnose the problem ourselves [13:27:39] average_drifter: which repo can mark try? [13:28:23] milimetric: /home/spetrea/wikistats/pageviews_reports [13:29:06] now it works, I don't know why, it's weird [13:29:19] I would still like if mark could have a look at it [13:29:52] git checkout now takes loads of time [13:31:22] milimetric: vim acting up, confirmed [13:32:58] isn't it just that the load is high and that another process is doing a lot of IO stuff? [13:34:42] there's like 2 processes [13:34:44] one of ezachte [13:34:49] k average_drifter, it's just 100% disk util [13:34:54] mark confirms [13:34:58] and one of halfak [13:35:06] but vim is definitely most horribly affected by that - must be the history tracking [13:35:21] i'd suggest editing in nano and expecting IO bound stuff to take forever [13:35:26] hm....................... [13:35:28] drdee! [13:35:36] maybe that's why Tillman got the email at 7:00pm [13:35:42] :D [13:35:48] though that's literally 19 hours after the job would've started [13:36:22] or wait no... it's 5 hours earlier than it should've started [13:36:26] ok, I have to go now, I'll be back for standup hopefuly [13:37:35] milimetric: could you talk to mark please and ask if he can solve this. I will have to roll out some new reports today [13:37:55] there's no solving it, it's just disks spiked to literally 100% [13:38:03] he suggested removing that symlink you have also [13:38:54] brb chickens & shower [13:38:57] spetrea@stat1:~/wikistats$ df -h [13:38:58] Filesystem Size Used Avail Use% Mounted on [13:38:58] /dev/mapper/stat1-root 14G 5.6G 7.7G 43% / [13:38:58] udev 16G 4.0K 16G 1% /dev [13:38:58] tmpfs 6.3G 356K 6.3G 1% /run [13:39:00] none 5.0M 0 5.0M 0% /run/lock [13:39:03] none 16G 172K 16G 1% /run/shm [13:39:05] /dev/mapper/stat1-a 6.4T 2.7T 3.4T 45% /a [13:39:08] 208.80.152.185:/data 48T 42T 5.8T 88% /mnt/data [13:39:10] /dev/mapper/stat1-tmp 50G 5.9G 42G 13% /tmp [13:39:13] /dev/mapper/stat1-home 1008G 413G 545G 44% /home [13:39:17] last row, doesn't seem to me like 100% [13:39:23] more like 44% [13:40:18] ok I'm relly going now [13:40:19] bbl [13:49:31] grrrrrr [13:49:53] ottomata changed it too - he made the job run at 02:00 UTC [13:49:56] drdee ^^ [14:52:26] mornin [14:53:29] morning [14:53:57] so, new zero job in the format that erosen expects is running [14:54:05] it replaces the old job. [14:54:09] nice [14:54:14] nice, you're about [14:54:20] output lives here: [14:54:26] let me know when it is ready and I'll give it a run through [14:55:10] http://localhost:8888/filebrowser/wmf/public/webrequest/zero_carrier_country/2013/ [14:55:12] and so on [14:55:19] when you drill to a job result, you get: [14:55:36] er, add a /view in there [14:55:37] heh [14:55:38] http://localhost:8888/filebrowser/view/wmf/public/webrequest/zero_carrier_country/2013/03/16/20.00.00/ [14:55:46] the first link should have been http://localhost:8888/filebrowser/view/wmf/public/webrequest/zero_carrier_country/2013 [14:55:56] anyway -- it has a directory for carrier and for country [14:56:04] nice [14:56:07] format is the same for both, but with - for carrier in the country files [14:56:14] * erosen ktunneling... [14:56:33] the rollups are currently disabled (as ottomata suggested) until the job catches up [14:56:42] then i'll have it generate a combined CSV for each run [14:56:51] or even a bigger rollup [14:57:28] http://stats.wikimedia.org/kraken-public/webrequest/zero_carrier_country/2013/03/16/00.00.00/ [14:57:31] there's a public link [14:57:40] dschoon: cool, the rollups aren't a necessity for me either, as my code does this [14:57:44] kk [14:57:49] then we'll ignore it :) [14:58:00] COOOL [14:58:13] it'd be fuckin sweet if we had an updated dashboard before scrum [14:58:21] dschoon, talk to me about the webrequest loss job again real quick, i haven't looked at it in a while and you had the other day [14:58:21] so we could perform the almighty Card Moving [14:58:30] oh, right [14:58:35] it needs: [14:58:35] i think i'd like that working again, i need to look into this packet loss issue more, and it would be nice to know if it was happening in kraken too [14:58:55] 1. fix coordinator.properties to not have Ye Accursed Typo [14:59:00] oh yes yes [14:59:07] (analytics1010.wikimedia.org) [14:59:12] dschoon: as for the card moving, shall we wait until i generate the dashboard and we decide that it looks okay? [14:59:14] 2. kill old coord [14:59:18] re submit? [14:59:22] 3. resubmit to oozie [14:59:23] ja [14:59:24] ok cool [14:59:26] on it [14:59:29] it's really too bad you can't tell it to reread [15:00:02] atm, i'm really pissy to find out that the ternary op in pig can only return *values [15:00:10] so this is illegal: [15:00:20] device_info = FOREACH device_info GENERATE day_hour, country, (is_wireless ? 'handheld' : (is_tablet ? 'tablet' : 'desktop')) as device_class:chararray; [15:00:30] only the inner ternary is correct :P [15:00:48] aye [15:01:04] dschoon, you like this? [15:01:05] ${YEAR}/${MONTH}/${DAY}/${HOUR}.${MINUTE}.00 [15:01:09] yeah. [15:01:10] vs $YEAR-MONTH/... [15:01:10] ? [15:01:13] regular. [15:01:16] ok [15:01:18] will do the same [15:01:19] exceptions suck :) [15:01:19] i think i do too [15:01:39] my device job is emitting its intermediate results (when it works) to /wmf/data/mobile/device_class [15:01:52] /wmf/data? [15:02:02] i figured it was time to start being a little more regimented about making data visible and re-usable [15:02:09] yeah. [15:02:25] for datasets you generate that can be re-used, but aren't necessarily public. [15:02:30] materialized views, annotations, etc [15:02:32] ahh, hmm [15:02:56] I do wish I had imported the raw logs in time basd directories too :( [15:03:09] something to fix when we recombobulate everything one day [15:03:19] yeah :/ [15:03:38] i think for the second iteration of the device job, i'm going to do by-minute rollups. [15:03:49] it reduces the data by like 1/10000th [15:04:02] hm, but you don't want the data too small, right? [15:04:02] but honestly nobody ever needs <1m resolution [15:04:06] yeah [15:04:18] also it makes the aggregate data larger [15:04:59] the overhead of starting the jobs might be more than actually running the jobs [15:08:13] [travis-ci] master/8edbaec (#103 by Andrew Otto): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5659065 [15:08:41] i think it's long-past time for FairScheduler [15:08:47] also, why the hell does that build keep failing? [15:08:51] it builds locally. [15:12:39] ottomata, did you get a chance to look at https://gerrit.wikimedia.org/r/54811? [15:12:52] we need it for today, along with another small change [15:13:05] (can't submit the other small change until I get this patchset done [15:24:12] [travis-ci] master/2bcda38 (#104 by Andrew Otto): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5659590 [15:26:19] AHHH i started, so many things! [15:26:31] hah, yes, i have lots of comments, I will review extensively [15:26:33] here's my list: [15:27:00] 1. get webrequest loss back up (hopefully on 30 more mins) [15:27:01] 2. review your thing [15:27:01] 3. python .deb for hashar (only 30 mins or less) [15:27:47] \O/ :-] [15:28:24] cool, thanks ottomata, your RT weeks are packed :) [15:29:02] this isn't even RT! [15:29:06] i ahven't even looked at RT this week yet! [15:29:09] this is alll people just asking me! [15:29:10] heheh [15:32:13] dschoon, should we kill the webrequest_mobile_device* coordinators? [15:32:23] the workflows are all dying [15:32:31] i'll take a look in a sec [15:33:14] ok ok, hashar, i'll do yours first, since paravoid already approved it, that should only take a few mins [15:33:30] niiiice [15:34:13] [travis-ci] master/36b6f84 (#105 by Andrew Otto): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5659984 [15:47:04] dschoon: that build keeps failing because travis-ci can't find the cdh packages [15:47:12] huh [15:47:24] i'll take a look later today [15:47:38] see https://github.com/travis-ci/travis-ci/issues/948 [15:47:55] travis-ci should setup a proxy repo [15:48:09] because their local settings is quite restricted [15:50:29] hmmm [15:50:33] oops wrong chat [15:51:32] hmmmmm indeed [15:54:43] gm all [15:55:12] good morning kraig [15:55:17] kraigparkinson: update [15:55:28] ottomata is working on reviewing the changes I submitted last night [15:56:29] morning [15:56:44] milimetric, cool. What's the likelihood of that being ready for showcase? [15:57:36] well, i thought it was pretty ready and that the review would be quick, but ottomata said he had loads of comments, so now I'm skeptical. I would estimate 35% likelihood [15:58:29] k, would be good to debrief at some point on ottomata's feedback and how we can learn from it. [15:58:32] dschoon, way to getting #244 ready for showcase [15:58:41] way to go, that is. :) [15:59:03] word [15:59:09] almost done with a fix for 61 [15:59:18] does that mean #60 is on its way for this morning? :) [15:59:27] er, right, thanks. :) [16:00:07] average_drifter, how about #60? [16:23:21] omw to the office [16:25:13] drdee, have a sec to chat? I want to see if we can get a clear policy on the Customer property in Mingle. [16:28:08] kraigparkinson: in a meeting with erikz [16:28:48] k [16:31:31] kraigparkinson: problems with stat1 blocked me to roll out reports for #60 [16:32:37] milimetric: how can i reach mark? [16:33:08] in the operations channel, you were talking to him there before [16:33:22] average_doc: ^ [16:33:24] whats his nickname? [16:33:28] mark [16:33:29] :) [16:33:36] ok [16:42:04] average_doc, hehheh, its really unclear what your problem is over in operations [16:42:43] right now it is 'git and vim weren't working, erik's job was using only 1 cpu' [16:43:26] well, ill leave it at that and hope that when ill get home it will be usable [16:43:44] so i can wrap up #60 [16:43:52] ottomata, the problems are very annoying and just started happening [16:44:08] but i agree that it's not very clear what's going on [16:44:17] or whether it's anything that ops could help with [16:44:42] but if you'd like to see it for yourself, ssh to stat1, open a file in vim, and try to save it [16:45:00] however, before that, have you had a chance to look at that patchset? [16:45:22] milimetric: is this a vim on stat1 prob? [16:45:36] it's been reaallllly slow for me too [16:45:42] yes erosen, but i'm pretty sure that's just a symptom [16:45:47] mark said disk usage spikes to 100% [16:46:02] this doesn't sound right to me, it sounds like there's some errant job [16:46:11] hrm [16:46:48] i was thinking it was an issue with my screen session [16:46:59] well, ping me if you find anything out [16:47:21] k [16:47:43] btw erosen: workaround is to use nano [16:48:05] so that makes me think it's vim's .swp file history tracking stuff that's causing an issue [16:48:32] check halfak's job on stat1 [16:49:05] milimetric: https://gerrit.wikimedia.org/r/#/c/54811/ [16:49:06] reviewed [16:49:54] vim seems totally fine for me on stat1 [16:50:14] woohoo thanks otto [16:50:21] yeah, that uses 100% of a cpu [16:50:27] ezachte is gzipping something right now too [16:50:54] yikes, halfak is running amysql query that is using 50% of memory [16:50:58] 14G [16:52:41] ezachte's stuff should keep running [16:52:50] i belief halfak's stuff is snuggle [16:52:56] yeah [16:53:03] and if it's misbehaving than we should just stop it for now [16:53:07] and warn halfak [16:53:25] kraigparkinson: quickly pre-scrum chat in 3 minutse? [16:53:30] drdee: and if it's misbehaving than we should just stop it for now [16:53:31] [12:53pm] drdee: and warn halfak [16:53:33] ottomata ^^ [16:53:34] sure [16:53:36] hmm, how do I get a hold of halfak [16:53:38] i just tried chatting at him [16:53:45] i will pm his email addresss [16:53:50] k, CC me [16:54:11] am in the scrum hangout./ [16:54:30] i cant get to scrum today [16:54:44] im 50m away [16:54:55] minutes [16:54:58] drdee ^^ [16:56:13] and ottomata, that .my.cnf.research file is safe in /a/? [16:56:19] it's not world readable or anything, right? [16:56:42] you need to make the file perms right [16:56:48] i put that in a comment too [16:59:47] thanks, got it [16:59:55] i don't know how to do the rsync thing... [17:09:34] erosen! [17:09:43] dschoon!? [17:09:53] sup? [17:10:07] qq -- You think those dashboards could make themselves before 11? [17:10:15] eeeh [17:10:17] perhaps [17:10:21] we are hoping to demo them at the showcase :( [17:10:21] I'd give 70% [17:10:24] :/ [17:10:26] i'll start now [17:10:28] hehe [17:10:33] anything i can do to make that happen? [17:11:02] dschoon: the only part that I am a little unclear about , is how to best grab the files [17:11:10] if you want to point it at a different limn (to leave the current ones there) that's cool [17:11:17] I'll just rsync to stat1 for now I guess [17:11:17] they're on stat1001 [17:11:25] http or local works [17:11:27] here: [17:11:43] yaeh, but because they aren't going directly to limn, i need them somewhere i can run my python code (stat1) [17:12:06] http://stats.wikimedia.org/kraken-public/webrequest/zero_carrier_country/2013/03/ [17:12:07] ^^ erosen [17:12:10] yep. [17:12:37] is it possible to do a recursive copy with wget and an apache file browser? [17:13:08] /a/srv/stats.wikimedia.org/htdocs/kraken-public/ [17:13:11] on stat1001 [17:13:12] also works [17:13:20] world readable [17:13:34] that's what I'll do, thanks [17:13:47] everything under zero is only ~15M [17:13:50] yes but I wouldn't put files in there manually [17:13:54] right [17:13:54] that dir is rsynced out of kraken [17:13:56] out of hdfs [17:14:00] he just needs to get to them [17:14:03] oh ok [17:14:04] to postprocess into a new location [17:14:09] yeah, just copying from [17:14:17] ok ok, sorry didn't read convo :) [17:14:45] that cool, erosen? [17:14:54] def lmk if there's anything else [17:16:12] dschoon: actually, the path for recursive copy isn't working / I don't understand how it is supposed to work [17:17:49] hm? [17:17:52] you upstairs? [17:18:02] naw, still at pydata [17:18:07] grah [17:18:16] okay, let's switch to gchat for pastecore [17:18:28] dschoon: actually i got it [17:18:33] k [17:18:43] wait, nvm [17:18:44] hehe [17:18:53] dschoon gchat it is [17:24:55] ottomata: ready for review: https://gerrit.wikimedia.org/r/54811 [17:26:18] okey dokey lets try it [17:26:19] merging [17:28:50] ah [17:28:51] err: Failed to apply catalog: Parameter path failed: File paths must be fully qualified, not '$rsync_from/mobile/datafiles' at /var/lib/git/operations/puppet/manifests/misc/statistics.pp:760 [17:28:53] single quote prob i think [17:29:31] yeah, milimetric, if you've got variables in your strings you need to use double quotes [17:29:35] sorry I didn't catch that [17:29:52] oh i didn't know [17:29:58] thought it was like JS :( [17:30:09] i'll fix and re-submit along with the other patch? [17:30:18] the blog one? naw bette rto be separate [17:30:23] ok, cool [17:30:25] try using a topic branch for it! [17:30:29] submitting the fix now [17:30:32] k [17:30:40] uh... would rather not, my feet are a bit on coals right now :) [17:33:26] ok, pushed ottomata, and made it run every hour per YuviPanda's request [17:33:43] wheee! [17:33:46] now... we wait an hour? [17:35:44] drdee: is the analyst scrum officially dead? [17:36:04] DarTar: NOOOOO [17:36:16] i'm at pydata, so I won't make it [17:36:20] sorry for hte lack of warning [17:36:28] ok, I'm hanging out :) [17:36:47] erosen: np [17:40:07] okey dokey milimetric, ! [17:40:07] # Puppet Name: rsync_mobile_apps_stats [17:40:07] 0 * * * * python /a/limn-mobile-data/generate.py /a/limn-mobile-data/mobile/ && /usr/bin/rsync -rt /a/limn-public-data/* stat1001.wikimedia.org::www/limn-public-data/ [17:40:09] looks good [17:40:24] so, one more thing, not for now [17:40:30] if/when we add more scripts like this [17:40:32] that gneerate limn public data [17:40:45] dartar, I was thinking we could combine analyst scrum into weekly analytics showcase discussion, how about that? [17:40:51] we might want to abstract this a bit more then [17:40:54] but for now this is good [17:40:55] yay! [17:41:06] cool, thanks ottomata, I agree [17:41:11] dartar, I think it would be good to surface that stuff to a broader if that's cool. [17:42:09] dartar, whaddya think? :D [17:42:44] kraigparkinson: I think it makes sense if we get enough participation [17:43:22] dartar, if you're all at the showcase, then you will. :) [17:43:49] and I expect you to be there. ;) [17:44:03] Unless you don't believe in freedom. [17:45:03] I'll be there but I am not sure how many other analysts will (evan/rf away, jonathan+aaron not attending, erik z?) [17:46:34] ahhhhh 15 mins til showtime, i gotta run to a cafe, be bask asap [17:46:54] erik z, yes. evan/rf maybe not today, but would how for most other weeks, jonathan invited and hope he comes. I'll add aaron. :) [17:47:55] aaron invited. and we had a great chat over dinner last night. :) [17:48:01] we're BFFs now. [17:48:41] ottomata: https://gerrit.wikimedia.org/r/54878 [17:48:46] that's the next change [17:48:59] kraigparkinson: 68 is ready for showcase [17:49:03] http://mobile-reportcard-dev.wmflabs.org/ [17:49:15] rock out, milimetric! :) [17:49:53] kraigparkinson: 154 is now ready for showcase too, thanks ottomata! [17:50:18] sweeeet. [17:50:29] brb with yummy chicken [17:50:52] btw - I feel an extreme conflict of interest by both eating chicken and taking care of baby chicks [17:51:06] just don't let the baby chicks see you eating chicken. [17:51:27] also - while we're on the subject, I have no idea why "chicks" is considered a derogatory term for women. Baby chicks are like by far the cutest little animals and also the smartest babies I've ever dealt with [17:51:31] "it's not what you think! it's not what you think!" [17:51:42] tastes like human. [17:51:43] :) [17:52:06] milimetric: sorry, been in the security meeting. so does the dashboard auto-update now? [17:55:22] YuviPanda: yes it does, every hour [17:55:24] on the hour [17:55:32] okay, so 35 minutes more [17:55:41] milimetric: did you deploy to mobile-reportcard with the updated urls? [17:55:45] no should be 5 more minutes [17:55:50] only mobile-reportcard-dev [17:56:08] but I pushed the change to the gerrit repo in the stages.py of limn-deploy [17:56:16] so you can do fab mobile deploy.only_data if you'd like [17:56:34] i just leave that up to you so it can be as stable (or unstable) as you like [17:59:34] milimetric: ah, okay [18:00:05] YuviPanda, milimetric: nice job [18:00:08] kraigparkinson, dschoon, drdee, ottomata, I added a hangout as one didn't exist: https://plus.google.com/hangouts/_/30310b318ca8d356e8f15b27ce3143a3a3e2887b [18:00:12] it's all milimetric's magic [18:00:30] * DarTar moving to Chambers brb [18:00:31] uh... I put the credit solely on ottomata's shoulders [18:00:42] without his puppet skills we'd have nothing [18:02:08] kraigparkinson, drdee, can you get into the hangout for the meeting? https://plus.google.com/hangouts/_/30310b318ca8d356e8f15b27ce3143a3a3e2887b ? [18:03:13] * milimetric feels ignored  [18:03:14] :) [18:03:47] milimetric: everyone jumped into meetings simultaneously [18:03:59] yes but nobody's chatting anything here... [18:05:21] because they're all in meetings! [18:05:27] i am in one :( [18:05:48] milimetric we are in https://plus.google.com/hangouts/_/5b70172d0f7418695ff6d98f3cb53bbb7097e020 [18:15:22] milimetric: deployed! [18:15:24] thank you! [18:15:45] no prob, looks good [18:15:50] and should've been updated at 14:00 [18:15:53] (EST) [18:16:06] drdee: i dont understand why https://mingle.corp.wikimedia.org/projects/analytics/cards/92 data output has country [18:16:30] i specifically took that out in all other parts of the story to keep it simple [18:17:34] milimetric: is there a way to see when it was last updated? [18:17:55] nothing fancy YuviPanda, you could look at the cron logs [18:18:06] heh, okay! [18:19:02] tfinc: sorry, should have been remove [18:19:03] d [18:20:37] analytics folks, i need some input on the cert request for stat1 (if anyone here is involved, i see andrew got cc'd) [18:20:38] https://rt.wikimedia.org/Ticket/Display.html?id=4473 [18:20:46] this is for stat1/stat1001/etc. [18:25:28] milimetric: can we change the headings on http://mobile-reportcard.wmflabs.org/ to be Commons Mobile Uploads instead of Mobile Wikipedia Contributions [18:26:10] :) the data is wholly owned by YuviPanda and you guys, the repository is in Gerrit [18:26:18] so you can change it any way you'd like tfinc [18:26:38] YuviPanda: --^ [18:26:40] tfinc: sure can do. [18:26:43] thanks [18:26:50] tfinc: will do later tonight [18:26:53] jwild pointed out that the heading was misleading [18:26:55] but I would like to mention that dashboard and graph editing is a hot topic that jwild has been an advocate for [18:27:22] we're hoping to get it in a sprint as soon as we achieve our Q1 goals [18:27:31] :D [18:27:53] milimetric: also would be nice is documentation for different nodetypes you can put in graphs :) [18:27:59] RobH: i belief we should only request ssl certificate for stat1001; stat1 should soon no longer be running apache2 [18:28:16] sure thing YuviPanda, will do that [18:29:12] well, RobH [18:29:19] drdee [18:29:23] when we were talking about ssl for stat1 [18:29:31] thats because halfak is hosting his snuggle dev from there [18:29:34] and wanted https for it [18:29:45] that is not apache [18:30:21] ottomata, see https://mingle.corp.wikimedia.org/projects/analytics/cards/385; we should not be running any web service on stat1 [18:31:27] drdee: yes, but isnt the url you guys serve not stat1001.wikimedia.org [18:31:47] but something like metrics.wikimedia.org ? [18:32:02] ie: shouldnt you guys want the cert for metrics.w.o not stat1001.w.o? [18:32:07] dschoon, see my comment in hangout chat [18:33:47] RobH, sorry, we are all in a meeting right now [18:34:54] thats cool, this can wait [18:35:07] just update ticket when you guys want movement ;] [18:36:59] drdee: mute yourself please [18:43:17] ottomata: sumanah is reporting problems accessing metrics.wikimedia.org from outside our IP range, indicating a possible DNS issue [18:43:44] can you double check? [18:45:41] ottomata: culprit found: HTTPSEverywhere [18:46:28] accessible via http, but we should resume the conversation on https - drdee ^^ [18:58:05] drdee: i made an update to the data table on https://mingle.corp.wikimedia.org/projects/analytics/cards/92 [18:58:22] ah, yeah, i just commented on snuggle's SSL RT ticket about that [18:58:24] DarTar [18:58:38] if there isn't already, created a ticket with a request for a metrics.wm.o cert and assign it to RobH [18:58:41] CC me maybe [18:58:47] will do [18:58:51] i can do the HTTPS setup, but if you want a cert he's gonna do it [19:30:53] drdee, is this the same as your 4.2.0 large job bug? [19:30:53] 2013-03-20 19:26:19,371 FATAL [Low Memory Detector] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Low Memory Detector,9,system] threw an Error. Shutting down now... [19:30:53] java.lang.InternalError: Error in invoking listener [19:31:08] java.lang.RuntimeException: InternalCachedBag.spill() should not be called [19:31:16] no, but that has been a recurring bug [19:31:22] requires fine-tuning jvm [19:31:25] grr [19:31:38] so I didn't up the JVM mem the other day, just turned off jvm reuse [19:31:47] should I turn up JVM mem? [19:32:56] or do I just need more mappers? [19:40:54] or more mappers might also do the trick [19:44:31] hmk [19:45:15] wait how do I get more mappers? is that possible? hadoop does that automatically, right? [19:46:38] I'm trying to install redis on toro.wmflabs.org (just a testing server for MW). [19:46:43] Dan said you guys had looked into it. [19:47:04] I got it showing up in https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup, but when I install it doesn't work since the default directory is /a/redis. [19:47:11] Apparently I need to subclass it with a real directory. [19:47:26] Before I mess with that or ask Ryan_Lane again, do you have any tips? [19:49:52] if you are just trying a quick and dirty approach [19:50:00] you could symlink /a/redis to wherever it needs to be, right? [19:52:15] dschoon, you there? [19:52:22] yep [19:52:46] loss job is failling due to low memory detector errors [19:52:51] any idea what I can do about that? [19:53:48] hm, just read this [19:53:48] https://issues.apache.org/jira/browse/PIG-3101 [19:56:57] ottomata, yeah, I guess. I'm trying to do it in a somewhat non-hacky way. [19:57:17] ok, cool. /a/redis a thing from puppet thing? [20:02:06] ottomata, yeah, it's the default dir in the redis class in operations/puppet. [20:02:43] interesting. [20:02:52] i'll take a look in after lunch, ok ottomata? [20:03:21] i figured out quite a few optimizations while working on the zero job [20:04:43] ok [20:04:45] cool [20:04:46] thanks [20:05:19] ah yeah [20:05:22] ok superm401 [20:05:33] how are you using the redis module? [20:05:49] are you just checking the box next to your the redis class in the labsconsole interface? [20:06:11] ottomata, yeah, currently. [20:06:14] ok, yeah [20:06:21] so the class is parameterized, so you won't be able to do that [20:06:25] It is not self-hosted currently. [20:06:27] are you doing self hosted puppetmaster? [20:06:28] ah [20:06:30] yeah, i think you'll have to [20:06:36] Ahg. [20:06:47] It's kind of annoying since you have to pull manually and you can't go back. [20:07:00] yeah [20:07:06] well [20:07:06] or [20:07:07] I see there's a Variables section at https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup [20:07:10] But that doesn't help? [20:07:11] you could push a new class or role [20:07:16] reading [20:07:46] naw, i think that is just a global variable, not a class parameter [20:08:21] i think... [20:08:53] actually i'm not sure [20:08:55] you shoudl ask Ryan_Lane about that [20:09:11] if adding variables to the class groups works for parameterized classes [20:09:21] but, i was going to say [20:09:27] hmm, naw, nm [20:09:31] what I was going to suggest isn't good [20:09:32] yeah [20:09:35] ask Ryan_Lane about that [20:09:37] and if it doesn't work [20:09:39] you'll have to do self hosted [20:09:54] ottomata, btw, you were offline while I was praising thy name [20:09:56] Okay, I'll ask him. [20:10:01] I appreciate your help though. [20:10:03] thank you so much for the help with puppet [20:10:19] today was productive because of that [20:10:53] ottomata is generally pretty awesome [20:15:04] ottomata: you still using asana for personal task tracking? [20:15:16] i was last week [20:15:17] haven't this week [20:15:26] daw, thanks guys! [20:15:30] ottomata: can do you know anything about the ticket for henrique stat1 access: https://rt.wikimedia.org/Ticket/Display.html?id=4726 [20:15:36] i like getting things done! [20:15:39] productivity ftw! [20:17:41] ottomata: ^^ [20:17:57] just asking RobH in ops, one sec [20:17:58] ottomata: alternatively, I should I just ping the RT ticket, or is that discouraged? [20:18:02] thanks [20:18:19] stealing and doing... [20:18:29] stealing and doing? [20:18:41] and no, no problem with pinging RT tickets after the 3 days have passed [20:18:55] hehe, yeah, RT has a 'steal' button, for taking the ticket from someone else and assiging it to yourself [20:19:10] oh its already assigned to me :p [20:22:46] erosen, do you think he'd rather have 'henrique' or 'handrade' or somethign else as his username? [20:23:05] 'handrade' is his preferred name [20:23:12] he's sitting next to me [20:23:15] btw [20:23:20] cool [20:24:03] hashar: hi, I have a song for you http://garage-coding.com/song.mp3 [20:26:02] funny song [20:26:22] what is the license ? :-D [20:26:52] hashar: it's like WTFPL [20:33:20] average_drifter: some garage trip hop ;-D [20:33:47] erosen: [20:33:47] notice: /Stage[main]/Accounts::Handrade/Unixaccount[Henrique Andrade]/User[handrade]/ensure: created [20:33:48] notice: /Stage[main]/Accounts::Handrade/Ssh_authorized_key[henrique@NBK-DTIC-ST05]/ensure: created [20:33:56] yo [20:33:59] yay [20:34:02] will check now [20:35:08] ottomata: small catch, apparently henrique doesnt' have his key on his new computer yet [20:35:12] !log restarting hadoop, upping mapreduce.task.io.sort.mb from 100 to 200 [20:35:19] oh [20:35:23] is it possible to swtich keys on labs and have that change propogrte? [20:35:30] does he have another key? [20:35:38] you'd have to give me the key [20:35:38] or is it hard coded [20:35:39] its hardcoded [20:35:42] ic [20:35:44] does he want both keys useable? [20:36:05] he doesn't need both [20:36:09] he just needs one [20:36:18] can I have him send it to you right now? [20:36:21] ja [20:36:42] ottomata, did we leave container reuse on? [20:36:54] no its off, 1 == off, right? [20:37:00] .... [20:37:12] [20:37:12] oh, heh [20:37:13] mapreduce.job.reuse.jvm.num.tasks [20:37:13] 1 [20:37:13] [20:37:14] yes 1 is off [20:37:15] right [20:37:17] num tasks. [20:37:20] ah num tasks [20:37:20] yeah [20:37:21] heheh [20:37:24] i was like, "1 sounds a lot like true to me" [20:37:26] okay. [20:37:28] haha, yeah [20:37:32] ottomata: key sent [20:37:38] hello ottomata [20:37:41] hiya! [20:37:42] just sent the key [20:37:48] i think my job failed because something ate all the memory on the cluster [20:37:50] thanks! [20:37:51] :) [20:37:57] and we don't have resource-aware scheduling turned on :/ [20:45:33] hm [20:45:39] zat why mine failed too? [20:45:44] i just restarted hadoop and am about to try mine again [20:48:55] hmm, same error for my job, even with sort.io turned up [20:49:07] 2013-03-20 20:47:06,166 INFO [Low Memory Detector] org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call - Collection threshold init = 139853824(136576K) used = 112921040(110274K) committed = 139853824(136576K) max = 139853824(136576K) [20:55:15] yeah [20:55:28] i think we the whole thing [20:55:31] this isn't the 4.2.0 large job issue? [20:55:43] do you seriously think these jobs are large? [20:55:49] mine is like, 2G [20:56:01] the output set is like 50M [20:57:00] no [20:57:03] its only an hour [20:57:09] i'm not sure how large 'large' is for that bug [20:57:51] yeah, mine died from low memory again. [20:58:00] how much ram does each of these boxes have? [20:59:29] 28G [20:59:31] 48G [20:59:31] * [20:59:50] memory is pretty utilized [20:59:52] 1G free [21:00:10] who there are a lot of datanode procs [21:01:20] lots of threads [21:01:22] hmm, is that a real threadd? [21:01:26] or a process? [21:01:29] i guess its a thread [21:01:29] hmm [21:01:36] do they have distinct mem usage? hmm [21:01:53] ottomata: any progress on the henrique key? [21:01:57] oh [21:01:58] done! [21:02:05] sorry forgot to ttell you [21:02:22] dschoon, 70 datanode threads [21:02:23] on an20 [21:02:30] brb meeting! [21:08:47] ottomata: superm401 and I were gonna try out his redis thing on puppet1 since that's already self-hosted [21:08:50] any objections? [21:09:06] that's fine [21:09:07] then we can use that as our "test random puppet stuff" box [21:09:08] k [21:09:13] yeah, that's what I meant it for [21:09:26] not for hosting things like limn or actually using redis though [21:09:30] just for testing random puppet stuff [21:09:30] Cool, thank you. [21:09:34] cool, and it's ok if we git checkout without worrying about what's checked out right now? [21:09:39] Understood. When this actually works it will be used on toro. [21:09:41] right of course [21:09:43] ottomata, i just got ssh access. thanks a lot! [21:14:14] yeah [21:14:15] doesn't matter [21:14:19] yup! [21:16:00] hmmmm [21:16:00] [21:16:01] yarn.nodemanager.container-manager.thread-count [21:16:01] 20 [21:16:01] yarn-default.xml [21:16:01] [21:16:51] drdee I took card 92, and I'll need some help understanding some pieces of it [21:16:55] namely "The count is the number of pageviews usingĀ org.wikimedia.analytics.kraken.pig.PageViewEvalFunc, but it should run in the webstatscollector mode plus deduplicating mobile api requests." [21:17:14] but we can talk tomorrow, I'm gonna finish some puppet testing and call it a day [21:37:31] that doesn't seem awful [21:37:38] the threadcount, ottomata [21:37:42] a huge amount of time is IO [21:38:01] yeah, but there are 92 nodemanager threads [21:38:02] not 20 [21:38:03] so iunno [21:38:08] unless i'm counting wrong [21:38:17] but yeah, they are threads, and i was trying to figure out what was using up all the mem [21:38:22] the threads should be shared mem, no? [21:42:15] hmm, i guess a lot of the mem usage is in buffers/cache [21:42:19] so thats fine [21:42:20] hm [21:42:21] i dunno [22:01:59] kraigparkinson: running 3 minutes late [22:10:13] drdee: lets talk about https://mingle.corp.wikimedia.org/projects/analytics/cards/381 [22:10:29] laters yalls [22:10:31] in meeting, let's schedule something [22:10:38] k [22:10:38] drdee: can I show you something ? [22:10:53] drdee: http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r38-kraken-logic/pageviews.html [22:11:24] have to add some code to process tab-logs differently [22:11:26] drat! the otto! [22:11:26] and then I'll run again [22:11:44] i was gonna say that having cluster-wide JVM heap numbers would be great [22:11:55] but it seems ganglia is only recording the namenode [22:59:26] average_drifter: you about? [22:59:34] quick question about your dClass JNI wrapper [23:00:10] when you instantiate the wrapper and then initialize it: [23:00:11] dClass = new DclassWrapper(); [23:00:14] dClass.initUA(); [23:00:39] I'm right here [23:01:00] yeah [23:01:36] i that an object that can be reused? [23:01:46] you mean thread-safe ? [23:01:46] (by re-initializing it with initUA()?) [23:02:10] no, mostly as a static field. [23:02:13] if you do initUA again you'll get leaks [23:02:36] because initUA allocates some memory [23:02:45] and the pointer to that memory is stored in the class [23:02:51] and it is de-allocated in the d-tor [23:02:54] without calling destroyUA(), right? [23:02:59] yes [23:04:06] what about calling dClass.classifyUA(ua) more than once? [23:05:19] ^^ average_drifter [23:06:08] classifyUA as many times as you want [23:06:12] no problem [23:06:15] https://github.com/wikimedia/dClass/blob/package/jni/dclass-wrapper.c#L103 [23:06:21] classifyUA doesn't do any allocations [23:06:23] any chance of it cross-poluting results with previous calls? [23:06:27] or freeing [23:06:31] ahh, thanks [23:06:41] i looked for the source but it was... unobvious to me [23:06:50] dschoon: are you sharing a dclass object between threads ? [23:06:52] i'm def not a C hacker. not since college [23:06:59] no, i'm contemplating using it statically [23:07:31] ahh. it makes a new hashmap every time. [23:07:43] but that's on-heap for the JVM [23:08:36] the only thing I advise not doing, would be sharing between threads [23:08:42] apart from that anything goes IMHO [23:09:13] yeah... i'm not sure how many threads pig runs in a single VM [23:09:47] when a new wrapper is created, how much work does it do? [23:10:47] hm. almost nothing? [23:10:52] dschoon: it loads a 2MB file in memory [23:10:58] ahhhh [23:10:59] okay! [23:11:03] see! that is good to know! [23:11:05] :) [23:11:09] dschoon: this file https://github.com/wikimedia/dClass/blob/package/dtrees/openddr.dtree [23:11:10] because i'm doing that a few billion times! [23:11:31] can you reuse the object instead of doing it many times? [23:12:03] so yeah. we'll see. [23:12:05] that's the plan. [23:12:17] but right now we're having serious problems with jobs running out of memory [23:12:33] and strangely, they started RIGHT AROUND when i launched two hourly jobs that involve device classification. [23:12:34] heh [23:13:31] can you tell me a bit more about how Apache Pig decides to create new dclass objects ? [23:13:42] if so we could figure out what's happening [23:13:45] pig is a dataflow language like R [23:14:13] you write mapreduce jobs in it, and it builds the low-level mappers and reducers, jars them up, etc [23:14:32] you can write custom functions, User Defined Functions, for it [23:14:48] and then you can use them in the high-level language, so you don't have to switch back to java [23:15:05] so we wrote a UDF that wraps your dClass JNI wrapper [23:15:19] that wrapper instantiates a new DclassWrapper on every record [23:15:25] that's not good [23:15:29] and there's one record for every line of input [23:15:38] which means a few billion [23:15:42] and how often does it get freed ? that's up to the JVM right ? [23:16:08] yes; garbage collection is one of the times the destructor will be called. [23:16:12] I remember talking to drdee about this [23:16:23] we saw this problem in the early stages of development of the wrapper [23:16:26] but i'm certain the framework controls gc closely [23:16:29] yes. [23:16:31] so! [23:16:34] and we discussed that the garbage collector would get us a bit into trouble [23:16:46] i'm going to try using one static reference to the wrapper [23:16:58] and re-calling classify for each URL [23:17:11] ok, please keep me in the loop about this after you do it [23:17:15] totes. [23:17:20] thank you [23:30:55] hm. [23:31:16] so, the destructor will *never* be called for my static reference. [23:32:30] yes, but at least you're sure you will have just one object [23:32:34] is that accurate ? [23:32:48] since it's an SO, it runs in my process-space, right? so when the JVM dies, the memory should be freed even if i don't call destroyUA() [23:32:51] yes. [23:33:26] put another way: the malloc for the dtree should be tagged to my pid, right? [23:39:54] when the jvm dies, the dtree is freed from memory [23:39:54] https://github.com/wikimedia/kraken/blob/master/kraken-dclass/src/main/java/org/wikimedia/analytics/dclassjni/DclassWrapper.java [23:40:07] problem with Java => there are no d-tors [23:40:12] now I remembered [23:40:35] we can expose the destroyUA and initUA to Pig [23:40:50] dschoon: do you think exposing those two directly to the Pig would help?