[08:04:44] Analytics-Features: Page view tool does not display graphs - https://phabricator.wikimedia.org/T138448#2400667 (jcrespo) > Musikanimal fixed it. It also looks resolved to me. [10:46:52] first vk patch https://gerrit.wikimedia.org/r/#/c/295652/1 [10:47:10] now we can set something like {..., "start-dt":"2016-06-23T10:44:59","end-dt":"-", ..."VSL":"timeout"} [10:49:12] great elukey :) [10:57:39] joal: this one was a sneaky "feature" [10:57:40] :P [10:58:28] I think that we should talk with traffic about increasing the Varnish shmlog buffer or something related, plus increase a bit the vk timeout limit [10:59:26] since text and upload handle way more traffic we'll have the change to get more timeouts if we don't do anything :( [10:59:31] anyhowwww [10:59:34] * elukey lunch :) [11:05:03] elukey: discussion with traffic seems a good idea :) [11:05:07] enjoy lunch ! [11:50:33] (PS1) Joal: [WIP] Add casssandra bulk loading classes [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295663 [12:24:38] joal: is this what we were discussing yesterday ?? --^ [12:25:06] ah no no [12:25:30] elukey: it actually is :) [12:25:31] I thought you were already ready for the new cassandra loading [12:25:35] realllllyyyy??? [12:25:38] woooooaaaaaaaa [12:25:39] elukey: WIP [12:26:01] \o/ [12:26:23] I manage to get the SSTable computed on hadoop (I think), but no streaming yet [13:08:33] elukey: Seems there is something weird on aqs [13:08:39] elukey: do you have a minute? [13:11:38] sure [13:11:47] aqs new or old? [13:11:52] actually both [13:12:15] When looking at bytes_in in ganglia, I have the feeling things bizarre are happening [13:13:41] mmm [13:13:47] can you give me the link? [13:13:54] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Analytics+Query+Service+eqiad&h=&tab=m&vn=&hide-hf=false&m=bytes_in&sh=1&z=large&hc=3&host_regex=&max_graphs=0&s=by+name [13:16:27] joal:yt? [13:16:32] I am nuria_ [13:17:04] joal: does this look right to pull dat afrom spark [13:17:04] What's up nuria_ ? [13:17:05] hdfs://analytics-hadoop/wmf/data/wmf/webrequest/webrequest_source=text/year=2016/month=06/day=06/hour=06 [13:17:09] sorry [13:17:13] trying gain [13:17:25] https://www.irccloud.com/pastebin/9UrSU1Fs/ [13:17:57] nuria_: various things: how do you launch spark? [13:18:11] and after that, how do you plan to access this data? [13:18:14] joal: spark-shell --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G [13:18:34] I was going to do selects [13:18:34] ok sounds good for spark launch [13:18:46] to get teh pageviews and on those look at wmf-last-access [13:18:54] nuria_: ok, so you can use hive metastore stragiht [13:19:10] no need to reinstanciate sql context, it exists [13:20:25] joal: sql context? [13:20:28] so you can go directly to: val df = sqlContext.sql("SELECT * FROM wmf.webrequexst WHERE webrequest_source = 'text' and year = 2016 and month = 6 and day = 6 and hour = 6") [13:20:47] line 3 of your pastebin [13:20:51] joal: ah i though that no longer worked somehow [13:20:53] not needed [13:21:12] nuria_: It didn't work for some time, then back to work with newer spark versions [13:21:13] joal: cause on the appsession job we access specifying the path [13:21:18] joal: ah ok [13:21:24] nuria_: if you reuse data, don't forget to cache [13:21:38] And, if you use sql, easier is actually create a temp table [13:21:39] joal: k [13:22:20] joal; k many thanks [13:23:36] nuria_: https://gist.github.com/jobar/63252ad3327df4d7a3b4a67aff3667c0 [13:23:55] joal: nice thank you [13:24:04] np [13:26:11] sorry joal I was taking a look to the new jobrunners that I've built, apparently they start working asap instead of requiring a pool/depool [13:26:14] :/ [13:35:26] madhuvishy: I was gonna add that tunneling info to Analytics/Cluster/Spark, I tried ssh -N stat1002.eqiad.wmnet -L 8088:stat1002.eqiad.wmnet:8088 and it didn't work and I'm afraid of ssh tunneling [13:52:31] joal: yes I'm at wikimania [13:52:36] but I'm in your timezone!! what's up [13:52:46] Hi milimetric :) [13:53:00] ok I have a question [13:53:06] this is really weird [13:53:21] Somebody (Jeph) was looking for me at wikimania, so I sent him to you ;) [13:53:26] yes, thanks for that [13:53:32] here's this code: [13:53:35] https://www.irccloud.com/pastebin/FVHK9TxN/ [13:53:37] not sure if I should be thanked ;) [13:53:47] oh no, totally, that's why I'm here [13:53:56] feel free to send infinite people, I'll keep thanking [13:53:59] ok, so that code [13:54:00] That's what I thought as well :) [13:54:03] sure [13:54:08] check out lines 63-65 [13:54:16] if I keep them in there, I get this: [13:54:46] :50: error: erroneous or inaccessible type [13:54:46] case (Some(event), None) => (Some(event), None, None) [13:54:58] and I'm like what?! [13:55:51] without those three lines, it runs fine [13:56:21] here, I will commit a patch to refinery-source so you can look at this if you want (wait are you on vacation?) [13:56:47] milimetric: no vacation yet :) [13:57:01] A path is the easiest: I'll be able to test [13:57:17] milimetric: Do you use the :paste trick? [13:58:03] paste trick? [13:58:20] (PS1) Milimetric: [WIP] Process Mediawiki page history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295693 (https://phabricator.wikimedia.org/T134790) [13:58:28] ok, that's the patch ^ [13:58:46] I didn't know how to import the Unserialize code from the other file, so it's very hacky [13:58:49] In spark shell, when you want to paste long functions, you can use the :paste command [13:59:05] then you paste a long piece of code, then you CTRL+D [13:59:06] but basically you should be able to copy paste the whole Unserialize file and then the other one is the one that's giving me memory trouble and so on. [13:59:15] And the thing gets interpreted [13:59:26] oh! [13:59:28] I can try that [13:59:32] :) [13:59:33] (CR) jenkins-bot: [V: -1] [WIP] Process Mediawiki page history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295693 (https://phabricator.wikimedia.org/T134790) (owner: Milimetric) [13:59:45] I never had problems before but that'd be crazy if that's a problem [13:59:48] It prevents from some shell specific errors [14:00:23] nope, same error [14:00:32] okey [14:06:56] yeah, joal I was basically trying to repartition and this error seems completely weird to me, so I'm starting to learn spark from scratch so I can explain why in the world repartitioning would mess up the compiler [14:15:56] milimetric: first time I see something like that ! [14:16:05] milimetric: will need some more time to investigate [14:18:28] Analytics-Kanban, Operations, Traffic, Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2402115 (elukey) Finally got the root cause of the VSL timeouts after a chat with Varnish devs. The Varnish workers use a buffer to... [14:31:45] no problem, joal, I appreciate the help and I would say don't bother except I'm kind of stuck on it [14:31:58] jaja [14:36:15] joal: meanwhile I'm looking at http://localhost:8088/cluster/apps/RUNNING and I can't find the job I jut ran [14:36:18] *just [14:36:34] milimetric: I can see it :( [14:36:42] milimetric: refresh? [14:37:03] just Ctrl+R right? [14:37:12] milimetric: hadoop is no socketio, needs refresh ! (f5) [14:37:15] I only see the one that nuria just started and nothing from me [14:37:31] ok! I see it now [14:39:53] yay, so this will eventually run out of memory: http://analytics1001.eqiad.wmnet:8088/proxy/application_1465403073998_43083/ [14:40:28] ooh new error: 16/06/23 14:38:02 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. [14:41:29] milimetric: definitely a coalescing issue [14:41:35] milimetric: batcave? [14:41:45] sure, lemme step outside of the crazy hack room and I'll brt [14:41:46] :) [14:41:54] ok ping me when ready [14:43:39] joal: ping [15:08:15] can someone remind me for the thousandth time what the thing is called for queueing tasks up based on the availibility of refined data in hadoop? :P [15:10:45] oozie! thats the one! [15:22:33] bd808: did you manage to get any oozie stuff yet to refine the api data? [15:22:59] addshore: well... no [15:23:19] addshore: see https://phabricator.wikimedia.org/T137321 [15:23:43] I have refined data in the bd808 db using some python and shell scripts from cron [15:23:52] but I haven't made it work properly yet [15:24:07] okay, well, im currently looking at https://phabricator.wikimedia.org/T138500#2402280 and using an oozie job to send some numbers to graphite! [15:24:44] milimetric: DAAAAAAAAN !!!!! IT WORKEDDDDDDDD ! [15:25:41] addshore: cool. I think fixing my stuff is a pretty short project. I just haven't spent any time trying to figure out how we normally do it and setting up the proper scripts [15:26:05] I think I'm just going to spend the next hour or so scratching my head while reading oozie stuff! [15:26:29] addshore: An example of a job extracting data and sending it to graphite can be found here: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/RESTBaseMetrics.scala [15:26:47] addshore: oozie is not the easiest, but we can help if needed [15:26:47] joal: I didn't know there was already one! oh thats awesome..! [15:27:02] addshore: Using spark to make things easier ;) [15:27:16] Then oozie just launches a spark job [15:27:30] addshore: But for this kind of usage, best would be to do streaming [15:27:30] epic [15:27:47] joal: hmm? [15:28:10] addshore: While streaming is really easy enough to setup, we don't have the tools to monitor those type of jobs easily [15:28:31] are there links to docs for this stremaing stuff? [15:28:45] addshore: Currently, oozie launches jobs every hour when data is available to extract interesting data and send it ti graphite [15:29:21] Best would be to read from from kafka and filter / process in almost-real-time then send to graphite [15:29:27] That would be streaming [15:29:38] Analytics: Split opera mini in proxy or turbo mode - https://phabricator.wikimedia.org/T138505#2402369 (Nuria) [15:29:41] are there examples of that already? [15:29:43] For streaming, two ways: Spark-streaming or Flinkl [15:29:51] Flink sorry [15:30:04] addshore: We don't have any running of those jobs, just POCs [15:30:04] My plan was to use something like pageview_hourly and just submit hourly day / aggregate it to daily data [15:30:23] addshore: That has a proven track record of working on our infrastructure :) [15:31:13] milimetric: The full stuff took about 30secs :D [15:31:16] YAY ! [15:31:29] * addshore goes to google spark-streaming and flink [15:32:14] addshore: Those are really funnier than oozie, but as I said, you'd be the first to make it in prod :) [15:32:42] addshore: Not to try to influence, but I really prefer Flink :D [15:32:48] :D [15:35:56] we can't really stream pageview counts to graphite. statsd will melt [15:36:24] we melted it sending action api counts several months ago [15:36:27] bd808: Another reason for which it's difficult is because of pageviews not available as a stream in kafka [15:36:42] joal: +1 dragon token [15:36:52] (dragon tokens are won by slaying dragons) [15:37:00] * joal is proud of his new token :) [15:37:48] milimetric: I just had to debug the implicit ordering issue, the rest was really working very well (heavy use of caching, awesomeness !) [15:38:12] do you mind pasting an Ordering[T] example thing? [15:38:25] I'll paste the one I used: [15:38:35] object PageStateOrdering extends Ordering[PageState] { def compare(a:PageState, b:PageState) = a.pageId compare b.pageId [15:38:38] } [15:38:58] ok, sweet, and how do I pass it? [15:39:26] Then, in fixedPoint function, when assigning new repartitioned RDD (p and k) --> val p = potentialStates.repartition(8)(PageStateOrdering) [15:39:30] val k = knownStates.repartition(8)(PageStateOrdering) [15:39:54] aha, but I was right then, no? Repartition(8) just returns a curried function? [15:40:00] But, the weird thing is, it's not working in paste mode, only in regular line-by-line mode [15:40:07] aaahahaha [15:40:10] what?! [15:40:12] milimetric: nope [15:40:13] lol, ok, thx anyway [15:40:22] that's good actually, just another dragon for someone else [15:40:32] milimetric: repartition has an optinal second parameter [15:40:37] And always returns a rdd [15:40:46] hehehe [15:40:54] cool, makes sense [15:46:27] aharoni: hi, if you sent anything the wifi problems might have eaten it [15:48:47] (PS2) Milimetric: [WIP] Process Mediawiki page history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295693 (https://phabricator.wikimedia.org/T134790) [15:49:26] (CR) jenkins-bot: [V: -1] [WIP] Process Mediawiki page history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295693 (https://phabricator.wikimedia.org/T134790) (owner: Milimetric) [15:50:24] milimetric: https://www.mediawiki.org/wiki/Universal_Language_Selector/Design/Interlanguage_links/metrics [15:50:34] k, looking [15:52:05] joal: sorry, I checked only now the graphs :( [15:52:09] what concerns you? [15:53:31] elukey: I don't understand why aqs1003 and aqs1006 have a lot of inbound traffic [15:53:39] compared to others [15:54:05] milimetric: you here? [15:54:11] yep, hey joal [15:54:36] Quick thing: I am trying to copy your code in my IDE (intellij) [15:54:43] yes [15:54:53] (probably throws errors about unserialize?) [15:54:58] It tells me an interesting thing on the php unserialize code [15:55:23] no errors, but an interesting finding [15:55:44] joal: seeds nodes? [15:55:47] Line 64 of your file, you compare two values of different types [15:55:57] or something similar [15:56:04] they must have a special function [15:56:06] oh cool, thx, will look after I talk to Amir [15:56:07] hmm, I need more precision around what you mean elukey :) [15:56:39] milimetric: Rest looks ok :) [15:56:50] (for PHP unserialize at least [15:59:05] joal: no definitely not seeds nodes, they are the ones collecting info about who joins/leave the cluster. But there might be some special thing that these hosts are doing on behalf of the cluster [15:59:36] hm ... Looks like a lot of traffic elukey [16:01:44] elukey: You're probably right, the difference is visible from long ago [16:02:01] joal: standduppp [16:02:09] oops joining ! [16:11:38] joal: you use intelij!! How do you work with the refinery-source repo? I load it in intelij and I dont see any of the directories or packages! [16:15:11] addshore: in standup, will answer after [16:15:18] awesome! [16:19:31] addshore: When importing the project, it worked fine for me :( [16:19:41] hmmmm [16:19:45] addshore: you can also manually add modules [16:19:57] oh hahahaa, I just switched back to the window and after about 10 mins everything has appeared.... [16:20:06] addshore: also, don't forget to enable maven :) [16:20:10] I guess it is just taking its time... [16:20:21] Analytics: Better identify varnish/vcl timeouts on camus - https://phabricator.wikimedia.org/T138511#2402494 (Nuria) [16:20:25] maven possibly (it has to download the entire world [16:28:55] woo, it has succesfully downloaded the whole world... [16:29:41] * joal bow to addshore's maven :) [16:33:01] *installs the scala plugin* [16:57:23] Analytics: Better identify varnish/vcl timeouts on camus - https://phabricator.wikimedia.org/T138511#2402666 (Nuria) We might also want to produce a report for ops regarding timeouts. See also ipv6 ticket: https://phabricator.wikimedia.org/T138396 [16:59:06] (PS3) Milimetric: [WIP] Process Mediawiki page history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295693 (https://phabricator.wikimedia.org/T134790) [16:59:49] (CR) jenkins-bot: [V: -1] [WIP] Process Mediawiki page history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295693 (https://phabricator.wikimedia.org/T134790) (owner: Milimetric) [17:44:26] a-team I'm gone for tonight [17:44:33] Have a good end of day ! [17:56:41] Analytics-Features: Page view tool does not display graphs - https://phabricator.wikimedia.org/T138448#2402836 (Neil_P._Quinn_WMF) Open>Resolved a:Neil_P._Quinn_WMF Yep, seems to be fixed. [20:19:30] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:21:49] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [21:48:07] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#1769033 (Tnegrin) Hi Folks -- this task lines up with work reading is doing with Comms on metrics. I'd like these efforts to be aligned as they are... [21:51:25] Analytics, Reading-analysis, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2403414 (Tnegrin)