[00:18:26] Analytics, Analytics-Kanban, WMF-Product-Strategy: Backfill pageview data for March 2015 from sampled logs before transition to UDF-based reports as of April - https://phabricator.wikimedia.org/T96169#1210993 (kevinator) @nuria the scope of this task is only to parse the sampled logs for the month of... [00:23:42] Analytics-Cluster, Analytics-Kanban: Compute pageviews aggregates daily and monthly from April {crow} - https://phabricator.wikimedia.org/T96067#1211007 (kevinator) [00:38:23] bd808: I'm trying to provision vagrant after enabling the wikimetrics role and it fails with this [00:38:26] https://www.irccloud.com/pastebin/cNbykGlZ [01:30:45] bd808, madhuvishy : seems that it is cloning wikimetrics at "/vagrant/srv/wikimetrics" rather than "/vagrant/wikimetrics" where we would expect [01:46:52] Analytics-EventLogging, Analytics-Kanban, Collaboration-Team, Echo, Patch-For-Review: Echo events not validating on EL - https://phabricator.wikimedia.org/T95169#1211258 (Mattflaschen) Open>Resolved [02:12:21] bd808, nuria: aah. [02:13:09] madhuvishy: hmm... [02:13:20] we moved the code locations around recently [02:13:44] this looks like a file ownership issues from puppet colliding with your file sharing [02:13:51] I think I know how it needs to be fixed [02:14:04] could you file a bug and assign to me [02:14:18] bd808: sure [02:18:18] bd808: filing bug now.. does this have to do with this - http://serverfault.com/questions/487862/vagrant-os-x-host-nfs-share-permissions-error-failed-to-set-owner-to-1000? [02:19:12] madhuvishy: yeah. we have a trick to get away without needing no-root-squash [02:19:28] oh cool [02:26:07] bd808: https://phabricator.wikimedia.org/T96221 First phabricator task I've created :D [02:28:54] madhuvishy: you did it just right. thanks [10:02:41] Analytics-Tech-community-metrics, Phabricator, Wikimedia-Hackathon-2015, ECT-April-2015: Maniphest backend for Metrics Grimoire - https://phabricator.wikimedia.org/T96238#1211835 (Qgil) NEW a:Qgil [10:03:03] Analytics-Tech-community-metrics, ECT-April-2015: Maniphest backend for Metrics Grimoire - https://phabricator.wikimedia.org/T96238#1211835 (Qgil) [10:14:20] Analytics-Tech-community-metrics, ECT-April-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1211887 (Qgil) [10:49:40] Analytics-Tech-community-metrics, ECT-April-2015: Provide list of oldest open Gerrit changesets without code review - https://phabricator.wikimedia.org/T94035#1211908 (Dicortazar) Hi, I'd like to confirm some examples: For instance, according to the Korma database and checking the Gerrit site, the two... [13:08:21] Analytics-Tech-community-metrics, Possible-Tech-Projects, Epic, Google-Summer-of-Code-2015, Outreachy-Round-10: Allow contributors to update their own details in tech metrics directly - https://phabricator.wikimedia.org/T60585#1212036 (Sarvesh.onlyme) Hello, I am having a problem regarding compl... [13:19:59] Analytics-Cluster, Analytics-Kanban: Compute pageviews aggregates daily and monthly from April {crow} - https://phabricator.wikimedia.org/T96067#1212057 (JAllemandou) Kevin: here is an example of data (flat file) we would get as a result :https://hue.wikimedia.org/filebrowser/view//user/joal/pageviews_agg... [13:28:32] Ironholds: Heya [14:26:17] holaaa [14:26:25] Hi nuria [14:26:29] Howdy ? [14:28:44] moorrrniin [14:30:08] mforns: we have tasking right? [14:30:18] kevinator: tasking? [14:30:19] nuria, yes I'm in the batcave [14:30:32] yes [14:30:35] ah ok, it was empty just now [14:30:42] Weird ... [14:30:43] joseph and I must be in an alternate universe [14:30:52] We are in the batcave with kevinator [14:31:56] nuria, kevinator, joal I think I am in an alternate batcave :] [14:33:07] heey! where are you?! :o [14:38:01] hangouts being silly again? [14:38:30] I've heard of it doing such things [14:45:32] joal i don't like that your job hasn't been started yet [14:45:36] your CONCAT(year ...) [14:45:41] Yup, me neither [14:45:51] oozie stuff, he ? [14:46:14] ? [14:46:28] too many oozie jobs ? [14:46:43] i do have a few extra things running right now, as there are still refined partitions that didn't run [14:46:50] and those are dependencies for some other jobs [14:47:02] I didn't know there was missing partitions [14:47:24] joal, run [14:47:25] refinery-dump-status-webrequest-partitions --datasets webrequest [14:47:40] (I add /srv/deployment/analytics/refinery/bin to my PATH) [14:48:25] Nice ! [14:50:16] I still don't get it why I don't get a share on the default queue :( [14:50:23] yeah me neither [14:50:36] maybe your job is large? [14:50:43] and it knows? [14:50:44] i'm not sure [14:51:19] no, one hour of data [14:54:27] ottomata: question for ya [14:54:30] ottomata: yt? [14:54:34] yup [14:54:36] hiya [14:54:44] ottomata: if we want to test impala [14:55:00] ottomata: should we install it in labs 1st , no puppet, all cruft [14:55:25] ottomata: or should we go for puppetization and install in cluster right away? [14:57:25] you can do in labs if you want, my plan was: [14:57:27] test in vagrant [14:57:30] see how it works with yarn [14:57:34] then if all is ok [14:57:39] puppetize, try in labs [14:57:41] then if that is good [14:57:42] install in prod [14:57:46] there are a lot of moving parts though [14:57:50] and i'm worried about resource allocation ow [14:57:56] we already are feeling the pinch [14:58:02] yeah [14:58:03] and impala kinda grabs things for itself [14:58:05] agreed [14:58:31] not true I think about resources: it asks yarn [14:58:36] ottomata: --^ [14:58:38] yes, it asks yarn [14:58:48] but, you were saying about llama and caching [14:58:49] right? [14:58:53] i think it tries to hold onto resources [14:58:57] but we already are just in resources with what we have ... [14:59:29] ottomata: Normally it releases resource in what I recall [14:59:35] hmm, ok [14:59:36] well, we will see :) [14:59:47] joal, I think we need a special queue for oozie launcher. am reading about DRF [14:59:50] https://www.cs.berkeley.edu/~alig/papers/drf.pdf [14:59:50] you know it? [14:59:57] nope [15:00:00] sounds interesting [15:00:16] I do agree with dedicated queue [15:01:18] ja, dno't know if DRF is appropriate yet, but neither fifo nor fair really make that much sense...not sure [15:01:43] was thinking maybe we can specify somehow what type or how much resources or share a launcher will need [15:01:45] cause it isn't much [15:03:40] agreed [15:27:57] ottomata: My job started [15:28:39] I think it's because at one moment, the fair share for the default queue went below it's overuse ratio, and started new jobs [15:29:01] yes [15:29:05] makes sense [15:29:11] which is a reason why we should move the oozie jobs out of there too [15:29:12] i think [15:29:23] i dunno, i'm kinda guessing here :/ [15:29:23] Correct [15:29:46] Particularly the other way around: my job shouldn,t block oozie ;) [15:30:59] ottomata: new standup starting [15:32:02] EEK [15:32:31] oof internet being crappy [15:32:36] trying [15:43:18] Analytics-Kanban, Analytics-Visualization: Build Multi-tennant Dashiki (host different layouts) - https://phabricator.wikimedia.org/T88372#1212430 (Milimetric) a:Milimetric [15:51:30] Analytics-Engineering, Analytics-Wikimetrics: "Validate Again" functionality is broken - https://phabricator.wikimedia.org/T78339#1212477 (Milimetric) a:madhuvishy [15:54:02] Analytics-Engineering, Analytics-Kanban, Analytics-Wikimetrics: "Validate Again" functionality is broken - https://phabricator.wikimedia.org/T78339#842830 (Milimetric) [16:01:12] Analytics-Tech-community-metrics, ECT-April-2015: Provide list of oldest open Gerrit changesets without code review - https://phabricator.wikimedia.org/T94035#1212559 (Qgil) Yes, I think we should keep the rules simple and list these as well. Even if these two changesets don't represent the case of a Ger... [16:12:24] Analytics-Tech-community-metrics, ECT-April-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1212614 (Qgil) >>! In T94165#1212592, @Ironholds wrote: > Fair. I have no idea what the resourcing for this will be, with Erik's dep... [16:19:12] Analytics-Tech-community-metrics, ECT-April-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1212633 (Ironholds) I'd be happy to help, but the 23rd is straight after I (probably? It's...not entirely clear) shift to working fo... [16:27:53] Analytics-Tech-community-metrics, ECT-April-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1212665 (Qgil) [16:35:03] ottomata2: any news on emails from eventlogging ? [16:39:20] news [16:39:20] ? [16:39:36] I don't receive any email, so I wonder :0 [16:39:38] haven't gotten any since monday [16:40:36] I guess it's in code review, right ? [16:42:28] thx tnegrin [16:42:28] :) thank you [16:42:53] ottomata: got an error from maven on PKIX certificate [16:42:56] first time ... [16:42:59] Any idea ? [16:43:18] joal: naw it is deployed, there haven't been any emails [16:43:26] maven on PKIX cert? [16:43:29] yup [16:43:32] we switched maven to https recently [16:43:34] SSL cert issue [16:43:39] you compiling on your local machine? [16:43:40] RIIIIGHT [16:43:43] Yup [16:43:46] I do sometimes :) [16:44:04] So I should add the cert to my java certl list [16:44:07] mouarf [16:44:53] hm. [16:44:55] i didn' thave any trouble [16:45:23] i just changed my configs to use https rather than http url, but i tested it and it worked even if i didn't do that [16:45:26] since it redirects to https [16:45:51] when you say configs, whAt do you mean ? [16:46:23] settings.xml [16:46:28] ~/.m2/settings.xml [16:46:32] right [16:46:46] hm, we should change that in pom.xml too [16:47:48] for the moment, doesn't work as is for me [16:47:48] on it... [16:47:48] Will do some research [16:47:48] hm [16:47:48] At least I know why now ;) [16:47:52] thx [17:25:20] joal: https://gerrit.wikimedia.org/r/#/c/204548/ [17:33:56] ottomata, hi, could you take a quick look at this - does reducing the number of reducing cause issues? I noticed that in the log reducer percentage kept droping to 0% which is very weird [17:33:56] https://gerrit.wikimedia.org/r/#/c/204153/1/scripts/countrycounts.hql,cm [17:34:18] *reducers [17:35:07] yurik: your one reducer is being preempted :( [17:35:07] https://yarn.wikimedia.org/proxy/application_1424966181866_88626/mapreduce/attempts/job_1424966181866_88626/r/KILLED [17:35:43] yurik: we reduced the number of jobs that could preempt things yesterday [17:36:03] * yurik is googling "preempting reducers" [17:36:04] however, i am currently trying to get some refine jobs to catch up, and I mvoed a few of them into the essential queue about 30 mins ago [17:36:15] essential queue may do aggressive preempting [17:36:40] so, if you are seeing this behavior in your current jobs (these reducers were all preempted in the last 30 mins), this might be why [17:37:07] would it help if i kept the old code? [17:37:27] ottomata, with DISTRIBUTE BY printf('%d-%02d-%02d', ${year}, ${month}, ${day}); [17:37:40] instead of ET mapred.reduce.tasks=1; [17:37:44] *SET [17:38:13] yurik: maybe, but also maybe not, i dont' know. you would have more reducers, and maybe each one would run faster, and therefore any single preemption wouldn't slow you down as much [17:38:20] right now its setting you back by 10-20 mins each time it happens [17:38:36] Analytics, MediaWiki-API-Team, MediaWiki-Authentication-and-authorization: Create dashboard to track key authentication metrics before, during and after AuthManager rollout - https://phabricator.wikimedia.org/T91701#1212974 (Tgr) * Successful logins via Special:UserLogin - `LoginAuthenticateAudit` hoo... [17:39:23] ottomata, in the logs i see that not a millisecond was added to ether of two jobs for the past half an hour or so [17:39:46] 91% and 24% [17:39:54] ok, makes sense then. i'd say just wait, i'm still trying to get the mess from yesterday cleaned up [17:40:01] joal recommended that you use one reducer, right? [17:40:08] yes [17:40:16] ok, i didn't follow that, but he probably had good reason to :) [17:40:38] well... if it causes much bigger delays... i dono ) [17:40:48] the idea was to optimize, not the other way around ))) [17:40:57] yurik: let me explain [17:41:19] Using distribute by send data to reducer based on the distribution key [17:41:34] yurik: it will cause bigger delays if you get preempted, which the cluster is more likely to do right now, because we are fixing things, and your job is lower priority! [17:41:35] oh, so in this case everything gets sent to the same one [17:41:42] yurik: you get it [17:41:58] So yeah, reducers would have finished, but maybe not THE one ;) [17:42:03] yurik: --^ [17:42:44] yurik: makes sense ? [17:42:52] joal, would it help if i moved everything into a subquery, and wrapped it with a "select * from (subquery) distribute by $date" [17:43:05] this way it would use multiple reduces [17:43:08] why do you need to distribute by a key? [17:43:13] can't you just tell it to use more reducers manually? [17:43:16] to have just one file [17:43:29] the result is usually 30 files, most of them empty [17:43:35] I men, you can't win very side :) [17:43:37] and since i do all that parsing by hand afterall [17:43:43] yurik: how big is the data? [17:43:43] every sorry [17:43:51] result - 5mb per day [17:43:53] why not just cat the files into one when you are done? [17:44:33] i could, was trying to keep it clean and easy to browse. if causes a lot of slowdowns, i will simply remove all the distribute by and max reducers [17:44:54] It's for you mainly [17:45:14] right, was just trying to keep the text files more browsable [17:45:33] ottomata, btw, i can't do cat on /mnt - i could load them all though [17:45:43] Having more reducers = less time spent for the job = less chances to get preempted all day long and never finsih [17:45:43] sure you can, just don't write to mnt [17:45:44] or [17:45:44] hdfs dfs -cat path/to/dir/* > onefile [17:45:46] or [17:45:58] hdfs dfs -cat path/to/dir/* | hdfs dfs -put - path/to/onefile [17:46:31] yurik: i'd recommend not using /mnt for anything except for browsing though [17:46:36] is it safe to cat things? if they were all generated by "group by", they wouldn't need to be added further, right? [17:46:37] or (if using compression) hdfs dfs -text /path/to/dir/*.snappy > res_file [17:46:38] its more a convenience than a reliable thing [17:46:59] nah, interesting approaches, but i don't want to touch /mnt [17:47:07] none of those use /mnt [17:47:08] so yeah, i will simply read files from it [17:47:15] hdfs dfs -put [17:47:20] that puts into hdfs [17:47:23] not /mnt/hdfs [17:47:40] doesn't /mnt/hdfs reflect what's in hdfs? [17:47:54] yes....but uses a user mounted filesystem thingee that is unreliable [17:48:02] if you use hdfs commands [17:48:04] it doesn't use that [17:48:11] hdfs commands talk to hdfs directly [17:48:33] that's why we mount /mnt/hdfs readonly [17:48:35] i don't trust it :) [17:49:09] It's funny: ottomata doesn't trust /mnt/hdfs, yuri doesn't trust hdfs dfs ;) [17:49:48] yurik: As ottomata says, Hive is usually good at gessing the number of ressources you need [17:49:57] hehe, i would be ok with using direct python funcs to call hdfs, but absent of that, i would much rather just read /mnt/hdfs from python as files [17:50:00] So you should trust it, then aggregate the result files [17:50:00] makes things simple [17:50:09] gotcha, will do [17:50:15] tahnks for all the explanation! [17:50:24] yuri, i would like to install this...maybe one dayyyy [17:50:25] https://github.com/spotify/snakebite [17:50:36] it is faster and kinda nicer than the default hdfs cli [17:50:41] lovely! yes please :D [17:50:56] ottomata, can we have hql queries as part of it too? [17:51:06] haha [17:51:14] um, i mean, hive has a jdbc connector [17:51:46] is it easy to use from python? [17:51:52] dunno, nver tried it [17:52:12] * yurik can't wait for full scale MIT licensed .NET deployment on wiki servers [17:52:21] i want to use proper LINQ :) [18:04:13] meh, one job is stalling, another - 1second of reduction per minute :) [18:04:15] sigh :) [18:11:31] milimetric: after I've provisioned wikimetrics in vagrant, how do i see if it's up and running? [18:14:37] milimetric: sorry just read README. Figured :) [18:23:02] madhuvishy: btw, we weren't expecting you to just figure these tasks out without any help, I'm happy to hangout and talk about any of them [18:23:21] like, point out pieces of wikimetrics, EL, etc. [18:24:44] milimetric: :) yeah, nuria helped with EL yesterday. Just looking at Wikimetrics now, will poke with questions in a bit [18:24:51] k [18:38:58] milimetric, do you have 15 minutes today to talk about EL problems? [18:39:10] mforns: definitely, anytime [18:39:42] milimetric, for me it could be now, let me know your preference :] [18:40:23] to the batcave! [18:40:35] xD [18:40:42] joal: if you're interested as well ^ we're batcaving on EL [18:40:51] now ? [18:40:55] milimetric: --^ [18:41:10] joal: yes [18:52:21] o/ joal [18:52:42] hey halfak, in a meeting, will ping you when ready [18:52:46] kk [18:53:06] I'm heading to a meeting soon too. Don't sweat it. :) [19:02:15] nuria: is there documentation on how to use mount to load local folders on mw-vagrant? i think i'm a little confused [19:03:48] Analytics-Tech-community-metrics, ECT-April-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1213391 (Qgil) @Ironhold, then you could dump your thoughts here before moving to the new team, and we will continue from there. :) [19:15:57] halfak: ready whenever [19:16:15] I'm in a meeting, but I'll be bad. [19:16:22] huhub [19:16:45] So, I am considering having my stream processor make external requests (via HTTP) in a mapper. [19:16:50] Good idea / Bad idea ? [19:17:01] hmm [19:17:17] Depends on the back of server answering :) [19:17:31] Basiccaly you parallelize http calls [19:17:40] Can be called DDOS in some countries ;) [19:17:47] Sort of. Most of the work is finding something that *needs* an http call. [19:17:58] :D [19:18:00] So it's going to spend most of the time filtering and a very small amount of time requesting data. [19:18:45] If you need one or two calls per message, and you have 15 messages per second, that's already some good pressure [19:19:06] For non-acknowledge bots, nettiquette is one call per second [19:19:19] for instance [19:20:34] ottomata: Do you think I have an archiva account ? [19:21:56] nope, you don't [19:22:05] i can get you the deploy pw though [19:22:12] i want to make that thing work with ldap [19:22:21] there was some bug when i installed it originally, and i never got it to work [19:22:24] i think probably they fixed it now [19:22:59] hmm [19:23:09] still not working with https for me :( [19:23:12] milimetric: Can't login to metric.wmflabs.com - it says it's locked or sth [19:23:14] weird [19:23:20] .org sorry [19:23:21] so ok, joal, let's figure this out [19:23:33] changed the url in the pom [19:24:41] madhuvishy: you mean: https://metrics.wmflabs.org/ ? [19:24:42] i just deleted one of my dependecies [19:24:48] and it redownloaded [19:24:48] Downloading: https://archiva.wikimedia.org/repository/mirrored/org/apache/hadoop/hadoop-client/2.3.0-cdh5.0.2/hadoop-client-2.3.0-cdh5.0.2.pom [19:24:49] or you're trying to log into the instance in labs? [19:24:51] jus tfine [19:25:00] what have I got different? [19:25:03] milimetric: yeah that [19:25:10] Have no idea :( [19:25:17] The correct certificates seem [19:25:21] milimetric: it worked now :/ [19:25:23] do you have a ~/.m2/settings.xml [19:25:23] ? [19:25:28] nope [19:25:30] hm [19:25:32] madhuvishy: you had "metric" and it's "metrics" [19:25:39] ok i will remove mine, but i don't htink thats it... [19:25:49] milimetric: nooooo. i typed it wrong :D [19:26:01] oh :) ok, then blame it on the gremlins [19:26:09] still no problems! [19:26:17] seems that I need to import the certificates ottomata [19:26:24] weird [19:26:49] milimetric: :D okay so i want to see where this validate again functionality is supposed to be, but don't have any cohorts to see it? [19:27:29] javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target [19:28:03] joal, you are os x, right? or liniux? [19:28:08] linux [19:28:40] joal I'm worries about http requests per second [19:28:54] yeah halfak I can hear that [19:28:56] I'll only need to make a request 1 per 200k records [19:29:17] depends on the rates of your records [19:29:23] But seems low :) [19:29:24] joal, googline [19:29:26] googling [19:29:28] http://commandlinefanatic.com/cgi-bin/showarticle.cgi?article=art032 [19:29:31] yeah, so do I [19:29:34] I guess I could do this in two passes -- one to filter out the problematic records and another to process just them. [19:29:35] ottomata: --^ [19:29:45] hmm [19:30:00] But I guess my concern is more general about best practices for requesting external resources in a hadoop job [19:30:28] Might be worse to do it in a separate executor : more requests at once [19:30:41] halfak: You can do it, no problem, just be concious ;) [19:33:32] madhuvishy: got your questions answered? [19:33:46] nuria: the mount one, no. [19:34:03] madhuvishy: there are no instructions cause there is nothing to do [19:34:28] madhuvishy: by default your /vagrant folder is mounted on the vm [19:34:35] so anything under it is visible [19:35:35] aaah. so if i enable wikimetrics role in vagrant, and want to develop in local. I pull the project into local /vagrant/srv/wikimetrics and it'll be synced? [19:35:52] rather /vagrant/wikimetrics [19:36:01] nuria: i think that's changed now [19:36:07] the "role" should put your checkout there if bd808 changes worked [19:36:39] madhuvishy: confirm with bd808 what should be right location [19:36:59] nuria: yeah, that change worked fine. I see wikimetrics in /vagrant/srv/wikimetrics [19:37:21] madhuvishy: remove role , remove depot, run vagrant provision [19:37:38] what is depot? [19:37:48] madhuvishy: add role again and let's see if it works, that way we are using his changes from scratch [19:37:57] madhuvishy: ay sorry "code repo" [19:38:05] joal: [19:38:07] something like [19:38:12] keytool -printcert -rfc -sslserver archiva.wikimedia.org > /tmp/archiva.wikimedia.org.pem && keytool -importcert -file /tmp/archiva.wikimedia.org.pem [19:38:12] ? [19:38:18] not sure what keystore path shoudl be though [19:38:23] i found one on my mac at [19:38:30] /Library/Java/JavaVirtualMachines/jdk1.7.0_71.jdk/Contents/Home/jre/lib/security/cacerts [19:38:33] ottomata: so we are ok sending jobs to cluster now right? [19:38:38] nuria: ja go ahead [19:38:44] i'm still cleaning stuff up but you should be good [19:38:44] yeah, tried that [19:38:47] will try again [19:38:48] hm [19:38:51] ottomata: --^ [19:38:58] i dunno, imean, i don't know that i will help, i can't reproduce! [19:41:39] ottomata: yeah .. I know [19:41:44] Thanks anyway :) [19:48:32] ottomata: got it to work, didn't add the cert to the right file [19:48:37] Thx again ! [19:49:45] ahhh, cool ok [19:49:46] phew [19:49:49] glad its working [19:49:56] So do I :) [19:50:04] a pain though, but it works ;) [19:54:52] (PS1) Ottomata: Build against cdh5.3.1 packages [analytics/refinery/source] - https://gerrit.wikimedia.org/r/204614 (https://phabricator.wikimedia.org/T93952) [19:54:56] joal: ^ [19:56:34] (CR) Joal: [C: 2] "I'd love to put those as variables :)" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/204614 (https://phabricator.wikimedia.org/T93952) (owner: Ottomata) [19:57:08] joal: do you know how to do that? [19:57:12] i feel like i've tried that before [19:57:22] also, the only variable you could add is cdh_version [19:57:24] oh yeah, works great [19:57:31] When I don't have cert issues ;) [19:57:33] the actual package version will change with each cdh release [19:57:54] (CR) Ottomata: [C: 2 V: 2] "Ok! If you can do it, then yay!" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/204614 (https://phabricator.wikimedia.org/T93952) (owner: Ottomata) [19:59:40] awesoooome, things coming back! [19:59:47] finally got all refined partitions back in place [19:59:56] which immediately launched a bunch of jobs :) [20:00:07] Thansk amilion for that ottomata [20:00:22] yup, so, i haven't restarted anything but the refine and load jobs today [20:00:29] so the other ones are still using the default queue for oozie:launcher [20:00:39] joal and mforns: here are all the gaps I found in Navigation timing: [20:00:43] its kind of annoying to try and restart them while they are running if i want them to just go, so [20:00:44] https://www.irccloud.com/pastebin/XvBHpChL [20:00:57] i'm going to wait, and eithe rrestart them before I leave for the day, or restart them tomorrow [20:01:03] joal: one thing i'm finding really annoying about this setup [20:01:10] is that there are producitno jobs that don't have bundles [20:01:25] hmmm The aggreagation ones ? [20:01:29] but, listing coordinators does not indicate which ones are top level coordinators, meaning that they must be restarted and maintained as coordiantors [20:01:30] ottomata: --^ [20:01:31] milimetric, awesome! [20:01:33] mforns / joal: notice that they happen a lot more in the last month [20:01:44] milimetric, yes... [20:02:00] the number of minutes and exact start times are rough, they could be off by 20 minutes or so [20:02:04] milimetric, and they seem to stop yesterday.. [20:02:08] (because I'm rounding to the nearest 10 minutes) [20:02:12] milimetric, sure [20:02:21] well, or it just hasn't happened yet today :) [20:02:47] Has anything happened around beginning of april ? [20:02:47] my script was actually showing some missing today but it's at the end of the day so I don't know if it's replag [20:02:57] yeah, that's when it looks to start going bad [20:02:58] milimetric, did you know that Edit events saw a drop yesterday at 00 [20:03:12] mforns: that makes sense, they are sampling now [20:03:13] joal: [20:03:13] mediacounts/archive [20:03:13] mediacounts/load [20:03:13] mobile_apps/uniques/daily [20:03:13] mobile_apps/uniques/monthly [20:03:14] pagecounts-all-sites/load [20:03:22] milimetric, since then, EL traffic has been much lower [20:03:26] ottomata: when you say no bundle for prod, it's only aggregation jobs, right ? [20:03:34] mforns: yeah, that's great, I should go thank them [20:03:37] Right ottomata [20:03:37] milimetric, oh interesting [20:03:53] We could probably bundle that ;) [20:04:03] joal: i wonder if we should make bundles for these anyway, even if they only have one coordinator? [20:04:06] not sure. [20:04:09] it would make it easier to manage [20:04:22] but, then, more XML :( [20:04:24] milimetric, it may very well be that the problem ceases now (we'll need to fix it anyway, but..) [20:04:28] let's call them aggregation and bundle if you wish [20:04:54] milimetric, but at least we don't loose the data [20:04:56] hmmmm, :) if it stops, we jump up and down and drink some champagne [20:05:00] it is quite nice to go here: [20:05:00] http://localhost:8888/oozie/list_oozie_bundles/ [20:05:03] oops [20:05:04] yeah, we should backfill [20:05:07] milimetric, xD [20:05:09] http://hue.wikimedia.org/oozie/list_oozie_bundles/ [20:05:11] ack [20:05:13] http://hue.wikimedia.org/oozie/list_oozie_bundles/ [20:05:22] https://hue.wikimedia.org/oozie/list_oozie_bundles/ [20:05:24] iuh, [20:05:25] mforns, milimetric make sure to look at sal logs as there were many outages for which there will be no data [20:05:27] *uhm [20:05:31] and see all of the production apps that need managed [20:05:41] sure others could submit bundles, but it is still better than looking in coordinators [20:06:19] iunno [20:06:26] mforns, milimetric like "disk full" events or times when deploy bad code and nothing got written right [20:06:35] nuria, milimetric, I looked at it yesterday and I did not find any incidents happening at the same time as the gaps I found (would have to look to that newly found gaps) [20:07:05] hmm, iiinterseting, joal, logs of unassigned jobs in the oozie queue :/ [20:07:10] mfornsm, milimetric also a service deploy restart (the way we do it) causes at least a couple minutes w/o data [20:07:10] guess that isn't working as is [20:07:29] mforns, milimetric : probably more if i do it as i just go super slow [20:07:52] mforns, milimetric so that is a given with the way we deploy [20:08:04] nuria, aha [20:08:20] ottomata: oozie needs strong preemption :) [20:08:38] Small weight (small number of jobs), but always run them [20:08:43] I think [20:08:43] mforns / joal: here's how I got the numbers: https://gist.github.com/milimetric/b3f3d34d8d6a77f28463 [20:09:20] joal, i wonder if I could just up minResouces? [20:09:21] mforns, milimetric also, note that navigationTiming extension has changed a bunch., perhaps a better table to look at is serversideaccountcreation [20:09:26] • minResources: minimum resources the queue is entitled to, in the form "X mb, Y vcores". For the single-resource fairness policy, the vcores value is ignored. If a queue's minimum share is not satisfied, it will be offered available resources before any other queue under the same parent. Under the single-resource fairness policy, a queue is considered unsatisfied if its memory usage is below its minimum memory share. Under dominant r [20:09:33] as that one hasn't changed since the beginning of time [20:09:37] nuria: the chunks that this script found missing were all pretty big [20:09:40] gonna just try that manually [20:09:43] it rarely found just 10 minutes of missing data [20:09:59] Maybe, but I don't see that as a game changer [20:10:02] but it doesn't look sub-10 minutes so it's fairly safe [20:10:18] milimetric: that's really cool :) [20:10:22] milimetric: did you look in a a table that is NOT that one, cause note , if events do not validate -at all - for a while you will find yourself with gaps [20:10:37] yeah, since it is being starved in its own queue :/ [20:10:37] hm [20:10:48] milimetric: so checking a table that is very stable (like Serversideaccountcreation) is a good check & balance [20:10:55] nuria: the reason I picked that one is because I know it's fairly consistent and small so the query would be quick. It's easy to try any other table in the query above [20:11:14] milimetric: it is not consistent, they changed it recently quite a bit [20:11:14] i'll check SSAC now [20:11:16] milimetric: i had to re-do the alarms [20:11:20] nuria: but they changed the version right? [20:11:52] milimetric: the schema /the flow of events and what the extension itself captures [20:12:08] ha, joal, one of these launchers says: [20:12:09] milimetric: so that is why i query that table, serversideaccountcreation and whatever other one [20:12:13] TotalMemoryNeeded [20:12:13] 3072 [20:12:13] TotalVCoresNeeded [20:12:13] 1 [20:12:17] that is a running launcher [20:12:30] i guess that is in MB, since it is running in a jvm [20:12:33] seezh [20:12:54] I guress s yes [20:13:00] milimetric: not looking at less than 10 sounds good, that makes total sense [20:13:04] wonder if we can ask for less... [20:13:11] Think to double check : default mapper jvm allocation [20:13:23] ottomata: yes, we can configure that for sure [20:13:28] yeahhh [20:13:28]    SET mapreduce.{map|reduce}.memory.mb=; [20:13:34] with oozie.launcher in front [20:14:22] My guess is that 512m should be enough :) [20:14:28] if not 256 ;) [20:14:31] nuria: so SSAC is almost the same, it showed a few different intervals but roughly the same [20:14:40] milimetric: ok, GOOD [20:14:46] ja 256 should be fine [20:14:48] i think [20:14:53] NT has 159 10 minute chunks and SSAC has 161, on quick look mostly the same timespans [20:14:54] agreed [20:14:55] milimetric: that way we know is a "system" problem [20:15:10] And I still think that preemption is strongly recommended [20:15:13] oh yeah, that was just meant to give an idea, not to be authoritative [20:15:18] milimetric, mforns not related to schema or extension deployment (which we also have plenty of, specially in mobile apps) [20:15:29] yes [20:16:23] nuria, the gaps that milimetric found for that table, do match with the ones I found, that were checked on Edit, NavigationTiming and another one which I do not remember [20:17:00] mforns: I'll paste you the SSAC gaps, they are actually a little bit different now that I look at it closer [20:17:05] nuria, milimetric, so I'm pretty sure that if not all, the majority of the gaps are across all tables and are not related to restarts [20:17:05] especially early this year ( the later ones match ) [20:17:15] milimetric, ok [20:17:51] oh [20:18:17] mforns: ok, GOOD, from 03/22 (leaving feb aside) what was the day we spawed the box again? [20:18:59] the db box [20:19:00] https://www.irccloud.com/pastebin/kbEZA9i4 [20:19:28] mforns / nuria / joal: that's ServerSideAccountCreation: ^^ [20:19:50] same rough pattern, some differences [20:20:44] milimetric, mforns : ya , everything after 03/19 looks related [20:20:54] nuria, Apr 3rd [20:21:52] ok [20:22:34] man... what a problem this one is.... [20:25:01] nuria, I'm trying to tcpdump the outgoing packets of the consumer to see the event timestamps [20:25:19] mforns: we swapped db box on february 26/27 [20:25:52] nuria, oh, I looked for something related to eventlog1001 in SAL and found Apr 3rd, sorry for that. [20:26:41] (PS1) Ottomata: Set memory that oozie:launcher map task takes to 256MB [analytics/refinery] - https://gerrit.wikimedia.org/r/204621 [20:26:51] joal: ^ i haven't tried that yet [20:26:56] i will tomorrow [20:28:01] mforns, milimetric and for those gaps of data data is IN logs (easy to check for serversideaccountcreation as server logs are samller) [20:28:04] *smaller [20:28:35] yea, true. we should be able to fairly easily backfill as this doesn't even need validation [20:28:44] milimetric, nuria, aha [20:31:19] POLO time laters! [20:31:26] mforns, milimetric ok, 1st thing would be to check logs and see those events are there for given internvals [20:33:22] nuria, I've already checked that for some (not all) intervals, it seems all events are there, validated by the processor [20:33:38] mforns: k, good [20:33:54] nuria, you think we should chell all intervals? [20:33:54] mforns: this is... SO.... puzzzling....!!! [20:34:01] mforns: nah [20:35:26] (CR) Joal: "For configuration facility, values should be passed as parameters instead of hard-coded. Why not having them in the coordinators or bundle" [analytics/refinery] - https://gerrit.wikimedia.org/r/204621 (owner: Ottomata) [20:37:05] madhuvishy: did you provision with role? is the code checked out? [20:37:16] nuria: yes [20:37:29] madhuvishy: is what location? [20:37:51] its at /vagrant/srv/wikimetrics [20:39:44] madhuvishy: and what do logs at /var/log/upstart say? [20:41:16] (CR) Nuria: "Can't this be a "global" oozie property?" [analytics/refinery] - https://gerrit.wikimedia.org/r/204621 (owner: Ottomata) [20:41:46] https://www.irccloud.com/pastebin/5KF7yLDD [20:42:50] nuria: this is in wikimetrics-web.log [20:43:18] madhuvishy: did you looked at wikimetrics README? [20:43:35] the other two logs say the same thing too. yeah i did.. [20:43:35] madhuvishy: seems that we need to build the app right? [20:44:09] madhuvishy: as really at wmf we do not have an env that "knows" how to install python [20:45:27] PHP should be enough for everyone, right? [20:46:29] nuria: right. does setup.py build it? [20:46:40] YuviPanda: if i do not write another line of php in my life i will be happy [20:46:54] madhuvishy: i think >python setup.py install [20:47:03] nuria: :D Me too. I think the world might be a better place if nobody had to, but oh well :) [20:47:03] nuria: i did that too [20:47:04] madhuvishy: (we will need to update README after all this) [20:47:27] madhuvishy: and what was the output , cause packages should be build then [20:47:38] nuria: all good [20:47:45] madhuvishy: and pip install? [20:48:02] nuria: all reqts satisfied [20:48:34] so you have: "/usr/local/bin/wikimetrics" but it doesn't find the wikimetrics egg, is that so? [20:50:05] joal: the x_analytics_map['uuid'] should be populated now right? or does it have issues like the timestamp? [20:50:34] should definitely be populated [20:50:38] nuria: --^ [20:50:52] nuria: I used the map for other request, but not uuid per se [20:51:54] madhuvishy: so you have: "/usr/local/bin/wikimetrics" but it doesn't find the wikimetrics egg, is that so? [20:52:08] nuria: we're in the batcave trying to debug, wanna join? [20:53:54] milimetric: i think you got it then, will go back to mobile sessions [20:55:49] joal: and this query looks ok , right? " [20:55:50] select x_analytics_map['uuid'] from webrequest where year=2015 and month=04 and day=12 and hour=01 and x_analytics_map['uuid'] is not NULL limit 100; [20:56:41] nuria: had to provision again! [20:57:26] madhuvishy: ok, and now you can access http://localhost:5000? [20:57:33] nuria: Yeah [20:57:45] madhuvishy: ok, ta-tachannnnnn [20:58:04] madhuvishy: take a look at wikimetrics README, we should updated it with any pertaining new info [20:58:18] madhuvishy: https://github.com/wikimedia/analytics-wikimetrics/blob/master/README.md [21:00:22] nuria: yes, query looks good [21:00:26] Having issues ? [21:07:45] joal: maybe there is too few values, then, taht's fine no owrries [21:07:48] *worries [21:08:44] hmmm [21:13:14] joal: will try to run it for the couple weeks of data we have for april [21:13:26] ok [21:13:36] Loking as well : seems not to have data [21:13:55] select x_analytics_map, x_analytics from wmf.webrequest where webrequest_source = 'mobile' and year=2015 and month=04 and day=12 and hour=19 and lower(x_analytics) LIKE ' %uuid%' limit 100; [21:14:12] Double checking for sure [21:14:16] With [21:14:26] the timestamp stuff, I'm afraid now ;) [21:14:48] nuria: --^ [21:16:47] joal: ya... [21:16:53] joal: seeing same thing [21:17:00] weird, hu ? [21:18:26] joal: but no worries, it is not the map [21:18:36] Well, worries still [21:20:19] joal: for mobile team yeah, will run query for all april [21:20:29] k [21:22:39] hey milimetric: got a sec? [21:24:38] hi joal. I'm wondering if the cluster is in a good shape that we can start running jobs on it [21:25:07] Hey leila: Cluster is good now, you can go and run things :) [21:25:18] great. thank you! :-) [21:25:24] joal, ^ [21:25:31] no problem [21:26:15] nuria: time for me to go to bed [21:26:28] Good luck with uuids ;) [21:26:36] Seey'all tomorrow ! [21:26:38] joal:ciao [21:30:39] nuria, milimetric, using tcpdump I discovered that the problem is actually in eventlogging side not db [21:31:22] the inserts that I catched outgoing now with destiny m4-master date from 1:30h in the past [21:31:30] ahajammmmm [21:31:34] dying to KNOW [21:32:23] mforns: let me guess SQLA has some buffer with asynchronous mecanism ;) [21:32:35] nuria, milimetric, so, given that the consumer logs do not reflect this lag, I guess the problem may be in sqlalchemy [21:32:45] joal|night, that's what I was going to say [21:32:51] :D [21:32:58] but that's still a theory, going to research on that [21:33:17] Yeah, I guess so [21:33:27] Good luck, and bravo for the finding ! [21:33:42] * joal|night bows to mforns [21:33:54] xD, good night! [21:35:17] mforns: uy , uy this is deep on the bowels of the code now, man, [21:36:06] nuria, googling [21:37:28] mforns: BTW, simpledateformat is not thread safe in java so a new instance needs to be created per usage [21:37:46] https://www.irccloud.com/pastebin/fU012W9w [21:38:12] nuria, oh! [21:40:47] mforns: just an FYI, for CR [21:41:11] nuria, ok [21:47:51] Analytics, Analytics-Kanban, WMF-Product-Strategy: Backfill pageview data for March 2015 from sampled logs before transition to UDF-based reports as of April - https://phabricator.wikimedia.org/T96169#1209816 (DarTar) Assigning this to @ironholds, as discussed during our 1:1 (thanks, O!) [21:48:04] Analytics, Analytics-Kanban, WMF-Product-Strategy: Backfill pageview data for March 2015 from sampled logs before transition to UDF-based reports as of April - https://phabricator.wikimedia.org/T96169#1213879 (DarTar) a:Ironholds [22:03:44] (PS9) Nuria: [WIP] Add Apps session metrics job [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [22:04:50] too bad that joal|night is asleep cause he'd like to know that lowering parallelization of the session job lowered its run time to 6 minutes for 1 day of data cc mforns [22:05:23] nuria, :O [22:06:17] mforns: ya, i know, super-good! we could run a day in couple hours , wow, that wasn't me i just did the testing [22:06:55] mforns: and removed some unnecessary additional computations , but basically is the code you wrote 1st [22:06:56] nuria, you mean run a *month* in couple hours, right? [22:07:07] mforns: sorry a MONTH yes [22:07:34] nuria, to lower parallelization means less partitions? [22:08:17] mforns: yes, [22:08:20] ok [22:08:32] mforns: so you know how i was saying the apps dataset is tiny [22:08:37] aha [22:08:49] nuria, makes sense [22:08:52] mforns: but mobile dataset is big, so we filter, filter [22:09:43] mforns: and once we have the data we lower the number of partitions 1 order of magnitude, I will do another test but it sure seemed to make a difference [22:09:54] from ~20 mins to 6 mins [22:09:59] seems a lot to be a fluke [22:10:05] mforns: but i will retry [22:10:12] nuria, ok fine [22:46:56] anyone knows if webrequest.agent_type will identify apps ? [22:51:12] halfak: can we chat in the batcave? [22:51:59] kevinator, I'm in a meeting now. Done in 10 minutes [22:52:13] ok it can wait [22:56:12] nuria, I think I found it [22:57:23] milimetric, nuria, I think it is actually a problem in the consumer's sql_writer method [22:57:48] the 'events' list is getting too big eventually [22:58:06] halfak: any final comment on the QR report for revscoring, LMK (I can make minor changes until tonight) [23:00:01] Looks good to me. [23:00:04] Well... [23:00:14] We have made substantial progress towards the revision coder. [23:00:22] But that's not demo-able quite yet. [23:00:37] Do you think we should note the adoption of OOjs UI? [23:00:58] DarTar, ^ [23:01:07] na – to be clear, we’re not stating that the project was meant to end by Q3 (the timeline is the one captured on the IEG page) [23:01:17] but we may want to update the ETA column [23:01:51] up to you, if you want to make any changes to the stand-alone deck that I sent leave them in clear so I can see them and port them [23:01:52] The prototype revscoring service was online a couple months ago, so that's good. [23:02:01] yup [23:02:09] No worries. I think this looks great. [23:02:14] cool [23:04:18] o/ kevinator just hopping into the batcave now [23:04:40] one sec [23:04:42] brt [23:05:06] Analytics-Cluster, Analytics-Kanban: {epic} Analyst runs query to get aggregated pageview counts {crow} - https://phabricator.wikimedia.org/T96314#1214143 (kevinator) NEW [23:06:38] mforns: aham [23:07:04] mforns: despite it being emptied every 1.5 sec of 400 items? [23:07:12] nuria, yes [23:08:02] and how is it happening? cause influx is about 300 per sec (maybe it went higher) [23:08:15] nuria, it seems the events list is being added more elements than popped [23:08:36] nuria, don't know [23:09:04] nuria, the thing is, the problem went past the logs in an unhappy way.. [23:09:18] mforns: ah yes, that is the important part [23:09:23] nuria, I'd say we should change the logs [23:09:34] mforns: cause code will have bugs but we need issues to be logged there [23:09:44] mforns: totally, we had no logs whatsoever before [23:09:53] mforns: what should we change [23:10:24] nuria, we should log the last event timestamp on batchEvents, not events [23:10:33] and maybe log the size of 'events' [23:11:43] nuria, and also.. I suppose that the program crashed by being killed by the system for high memory consumption [23:11:44] right? [23:11:47] mforns: please do add that [23:11:51] mforns: sounds good [23:12:04] mforns: high memory?no i doubt that [23:12:17] mforns: who would kill it? [23:12:22] nuria, then how do get events lost? [23:12:35] mforns: well, that is the 1 million dollar question [23:12:40] nuria, the system [23:13:07] when you have a program that is stagnating the OS, the OS killes it, no? [23:13:31] because 2 hours of events is like 1GB [23:13:46] mforns: but how are those not being batched in 400 items [23:14:00] it could be growing (kind of like a memory leak if you may) [23:14:54] nuria, 'events' keeps getting big until 400, at that point it flushes to 'batchedEvents' [23:15:21] mforns: yes and 400 events are inserted in <1.5 secs [23:15:31] nuria, while the child processes that, 'events' continues to get bigger [23:16:00] mforns: yes, but influx is not unlimited, we have been running at 400 events per sec [23:16:26] mforns: let's log size of queue [23:16:31] nuria, 'events' may get to 400 before the child finishes the processing of the first 400 [23:17:37] nuria, at that time the parent flushes them into 'batchedEvents' once more, and sets ready=true [23:17:49] nuria, but the child is still working on the first 400 [23:18:04] mforns: yes but that is true if the influx is bigger than what it takes to store 400 [23:18:24] mforns: if you look at logs you will see see how long it takes to insert 400 [23:18:26] nuria, yes, it seems that that's wht is happening [23:18:34] is about 1.2 secs [23:18:46] nuria, mmm [23:18:48] mforns: but we do not run at 400 events/sec at all times [23:19:54] nuria, this is a histogram of run times: http://pastebin.com/v5MiJUfn [23:20:27] nuria, I think most batch inserts take around 2 secs now [23:20:38] mforns: man too bad graphite data is not there, but now we are running at 300 per sec [23:20:41] nuria, and some of them take quite longer [23:21:31] mforns: but then that is a db problem , cause 400 events are 10 schemas ad most and that shoudl be real fast [23:21:42] nuria, aha [23:21:57] mforns: they are entered per schema [23:22:13] so 10 inserts? [23:22:33] and -need to check- but we didn't used to have more than 10 schemas with inflow of more than 1 per sec [23:22:41] mforns: say 20 ad most [23:22:56] mforns: we can check it [23:22:59] ok [23:23:41] nuria, well to be completely sure of that, I will push a change with the logs on event size and correct last event timestamp [23:23:52] we can deploy it tomorrow [23:24:11] mforns: i think your description is fine [23:24:37] mforns: sounds like a likely cause (if you see in tcpdump events of 1 hour ago) [23:24:45] yes [23:25:07] mforns: but it is a db issue that 400 events take >2 secs [23:25:18] mforns: as when i tested this size initially it was ~1 [23:25:30] nuria, yes you're right [23:27:00] nuria, maybe increasing the batchSize, it would be lighter for the db? [23:28:07] mforns: i reduced it for that reason, when i 1st deploy i had it at 1000 [23:28:16] nuria, oh... [23:28:22] mforns: mannnn.... [23:28:51] ehem hadoop.. [23:29:00] kevinator: when you’re done talking to aaron, I have a quick question for you regarding AnEng reqs in this fiscal [23:29:19] I’m sitting behind the paper wall couch [23:32:48] mforns: you can try changing it but see: https://gerrit.wikimedia.org/r/#/c/197070/3/server/eventlogging/handlers.py [23:33:01] DarTar: where are you? [23:33:18] kevinator right next to you, paper wall [23:33:22] mforns: https://wikitech.wikimedia.org/wiki/EventLogging#Benchmarking_DB_inserts [23:33:44] nuria, no no, I trust you, I was saying we should switch EL to hadoop [23:33:55] mforns: right right [23:34:15] mforns: my comment was towards "increasing" batch size, sounds like 1) inserts take too long [23:34:30] nuria, aha [23:34:30] 2) decreasing batch size might help [23:34:40] ok [23:34:48] nuria, we can try that [23:34:59] mforns: but it is getting late forya, we can talk about this tomorrow [23:35:10] nuria, yes ok [23:35:25] thanks for the help! [23:35:33] good night!