[00:00:03] <Ironholds>	 do you remember the queue that the job you killed was in?
[00:00:05] <yurik>	 i have no problem with getting them reviewed obviously
[00:00:33] <yurik>	 Ironholds, the only information i could get about that job came from mapred - i emailed about it to aottomata, and posted it here
[00:00:38] <Ironholds>	 gotcha
[00:00:42] <yurik>	 job_1424966181866_83018	RUNNING	1428880704022	hdfs	root.essential	NORMAL	102	1	313344	36864	350208	http://analytics1001.eqiad.wmnet:8088/proxy/application_1424966181866_83018/	488448
[00:00:49] <Ironholds>	 root.essential
[00:00:55] <Ironholds>	 you killed a hdfs-administered, root.essential job.
[00:01:30] <yurik>	 Ironholds, i tried killing my own jobs first - it had 0 effect
[00:01:34] <Ironholds>	 https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load#What_to_do_if_the_cluster_is_stalling.3F says quite clearly that your options are (1) kill your own jobs, emphasis yours, or (2) /ask/ someone else who is running an intensive task to kill it or (3) add more hardware which I suspect might involve a bit of paperwork
[00:01:45] <yurik>	 they are all root.essential in that list
[00:01:51] <Ironholds>	 instead, you killed someone else's job, without asking first. Not only did you kill someone else's job, it was a hdfs-administered root.essential job.
[00:02:09] <Ironholds>	 You realise this was quite possibly an ETL or data consumption task?
[00:02:55] <Ironholds>	 please do not ever, ever do that again. You don't kill someone else's job without asking, you definitely don't kill a root.essential job.
[00:03:46] <Ironholds>	 Now, I'm going to go email analytics-internal and tell them that ETL or data consumption is quite possibly behind and they should do that, and also remind the public list (not mentioning, like, names or what happened) that other peoples' jobs are verboten unless your name is nuria or otto
[00:04:43] <yurik>	 Ironholds, my appologies, but here was my reasoning: i see 14 jobs, they don't change in 20 min, i try killing my own jobs and nothing happens
[00:04:59] <yurik>	 thus if i don't do something, the whole server is at a standtsill, helping noone
[00:05:11] <Ironholds>	 Well, I'm looking at your jobs
[00:05:24] <Ironholds>	 and they're both still running and they're both using the vast, vast majority of the memory in use
[00:05:33] <yurik>	 those are not the same jobs
[00:05:59] <yurik>	 there is a script that runs them one by one
[00:06:08] <yurik>	 i run the script, it executes them
[00:06:17] <Ironholds>	 future design pattern, though: when you kill jobs and it has no effect, the question you ask is "why did this not have any effect?" not "which job do I get rid of next?"
[00:07:31] <yurik>	 sure, please document https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load -- i looked, posted a comment here, noone replied for half an hour
[00:08:03] <yurik>	 thus if i continued doing nothing, i figured it would cause much bigger issues than killing one job (note that all of them had root.essential)
[00:08:04] <Ironholds>	 on a sunday
[00:08:18] <Ironholds>	 and yes, I suspect all of them have root.essential because most or all of them are ETL or consumption tasks.
[00:08:47] <yurik>	 but my point is that none of them could go forward, thus one job vs all
[00:09:01] <yurik>	 again - i am happy to go by the book - i looked at all the available docs
[00:09:02] <yurik>	 and the graphs
[00:09:03] <Ironholds>	 Sure, so there is a contact list of phone numbers, and there are several analytics mailing lists
[00:09:13] <yurik>	 there is a list of phone numbers?
[00:09:14] <yurik>	 where?
[00:09:25] <Ironholds>	 the contact list on the office.wiki, but the mailing lists are probably a better idea
[00:09:43] <yurik>	 Ironholds, please take a look:  https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load#What_to_do_if_the_cluster_is_stalling.3F
[00:09:49] <Ironholds>	 yes, I did
[00:09:53] <Ironholds>	 (1) kill /your own/ jobs
[00:10:05] <Ironholds>	 (2) /ask/ someone else for /them/ to kill their jobs.
[00:10:08] <Ironholds>	 (3) buy more hardware
[00:14:39] <yurik>	 again, i'm very sorry if i caused any problems. I tried to resolve the issue when i saw it, and when i didn't see anyone on the channel - figuring that having a stalled cluster for all jobs is much worse than killing one (that could probably be re-ran anyway). I looked at mem consumption (assuming rsrvMem is the right column), and i saw that my job was taking significantly less memory than others:  350,208   227,328   159,744   36,864   36,864
[00:15:05] <yurik>	 thus without any other guidance, i decided it was the best course of action
[00:16:16] <yurik>	 i probably should have killed my own two jobs first, just to be sure, but simply looking at the list produced by mapred, they seemed like very minor
[00:16:36] <Ironholds>	 yeah, I have no idea on that front; didn't see it until you killed them
[00:16:43] <Ironholds>	 and I get that you were trying to do the right thing
[00:16:51] <yurik>	 i have a copy of the mapred output at different times
[00:16:58] <Ironholds>	 (this is appreciated. People caring about the clusteris always appreciated)
[00:17:33] <Ironholds>	 but, those guidelines exist for a reason, and one of them is that if a regular job is killed, AnEng has to clean up after it and the cluster then has to play catchup, which slows things down for a longer period of time and potentially slows data availability
[00:18:05] <yurik>	 i understand, and again - i tried to do the "right thing" (tm)
[00:18:07] <Ironholds>	 FWIW, when you get your job in, give me a poke; I'd be happy to take a look and see if I can think of any ways to reduce memory consumption or generally help out. And if not, let's poke AnEng and find out what the hardware purchases for next year look like. A big cluster is a happy cluster :)
[00:18:13] <Ironholds>	 yeah, totally; like I said, the intentions are appreciated.
[00:18:53] <yurik>	 please notice that i have also tried to clean up some of the cluster docs - but more work is obviously needed
[00:19:00] <yurik>	 sure, will do in a sec
[00:19:27] <Ironholds>	 great :)
[00:19:35] <Ironholds>	 I will add hadoop job -list to the docs, too
[00:21:26] <grrrit-wm>	 (PS1) Yurik: countrycounts fixups - don't overwrite if exists [analytics/zero-sms] - https://gerrit.wikimedia.org/r/203780
[00:21:55] <grrrit-wm>	 (CR) Yurik: [C: 2 V: 2] countrycounts fixups - don't overwrite if exists [analytics/zero-sms] - https://gerrit.wikimedia.org/r/203780 (owner: Yurik)
[00:22:42] <yurik>	 Ironholds, https://git.wikimedia.org/blob/analytics%2Fzero-sms/HEAD/scripts%2Fcountrycounts.hql
[00:22:49] <Ironholds>	 awesome; thanks :)
[00:23:58] <Ironholds>	 so, are you looking specifically for pageviews? Looking at https://git.wikimedia.org/blob/analytics%2Fzero-sms/HEAD/scripts%2Fcountrycounts.hql#L53
[00:24:45] <yurik>	 no, to the total bandwidth + number of requests + pageviews
[00:25:13] <yurik>	 couldn't figure out an easy way not to dup the condition
[00:25:18] <Ironholds>	 ahhh
[00:25:29] <Ironholds>	 yeah, that sounds...difficult
[00:25:48] <yurik>	 https://phabricator.wikimedia.org/T95836
[00:25:53] <yurik>	 working around that issue
[00:26:01] <Ironholds>	 *nods*
[00:26:15] <Ironholds>	 I should probably wander for the evening so I can finish a software engineering analysis I've been working on for my community
[00:26:24] <Ironholds>	 but I'm happy to check it all out Monday if it'd be helpful?
[00:27:55] <yurik>	 sure, thx
[00:28:17] <yurik>	 have to go as well, sorry for the trouble. We should probably have a special !alert command or something here
[00:28:33] <yurik>	 to ping all people who might decide how to save a cluster
[00:28:40] <Ironholds>	 np!
[00:28:42] <Ironholds>	 and agreed
[00:28:45] <Ironholds>	 stalkwords would make sense
[00:31:55] <jgage>	 hi, hadoop datanode analytics1017 crashed and i've just rebooted it. i'm upgrading all packages and will do a second reboot in a couple minutes to update the kernel.
[00:38:43] <Ironholds>	 jgage, thanks! send a note to the mailing list? IRC on a sunday...
[00:39:50] <jgage>	 sure. first i'm going to try to figure out what's up with this bigtop-jsvc package which was downgraded when i upgraded everything else..
[00:40:21] <Ironholds>	 heh; fair ;p
[00:42:36] <jgage>	 looks like it was actually an upgrade; confusing version strings. cdh 4.3.1 for precise came with "1.0.10" but cdh 5.3.1 came with "0.6.0". looks like the right version is installed now.
[00:42:55] <jgage>	 1.0.10-1.cdh4.3.1.p0.71~precise-cdh4.3.1 vs 0.6.0+cdh5.3.1+615-1.cdh5.3.1.p0.17~trusty-cdh5.3.1 (gesundheit)
[00:43:07] <Ironholds>	 oh, CDH
[00:43:34] <Ironholds>	 "we've got a Thrift ODC client! But only for version <n or >N and nothing in between and we're deprecating it with >X
[00:43:43] <Ironholds>	 I sort of just smile and nod at them these days
[01:48:27] <nuria>	 jgage: where you able to resurrect the cluster?
[02:51:20] <wikibugs>	 Analytics: Junk in wmf.webrequest.uri_host field - https://phabricator.wikimedia.org/T95836#1202076 (Yurik)
[03:25:11] <yurik>	 jgage, hi, i'm looking at my query, and it has been getting about 4 seconds per minute of cluster time. Is it alvie?
[03:25:30] <yurik>	 now it has stopped completely (
[03:26:38] <yurik>	 lol, i guess it only works when i whine here )))
[04:11:48] <wikibugs>	 Analytics-Cluster, Analytics-Kanban: Report pageviews to the annual report - https://phabricator.wikimedia.org/T95573#1202198 (kevinator)
[07:06:21] <wikibugs>	 Analytics-Cluster, Analytics-Kanban: Add better timestamp field to refined webrequest data - https://phabricator.wikimedia.org/T94584#1202267 (JAllemandou) Open>Resolved
[07:06:29] <wikibugs>	 Analytics-Cluster, Analytics-Kanban: Add map of tags to refined webrequest table - https://phabricator.wikimedia.org/T95178#1202268 (JAllemandou) Open>Resolved
[12:01:47] <wikibugs>	 Analytics-Tech-community-metrics, Phabricator, Wikimedia-Hackathon-2015, ECT-April-2015: Metrics for Maniphest - https://phabricator.wikimedia.org/T28#1202633 (Aklapper) p:Low>High
[13:00:29] <wikibugs>	 Analytics-Cluster, Analytics-Kanban: Add better timestamp field to refined webrequest data - https://phabricator.wikimedia.org/T94584#1202675 (Ottomata) So great, thanks!
[13:15:27] <grrrit-wm>	 (CR) Ottomata: [WIP] Add Apps session metrics job (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns)
[13:15:32] <grrrit-wm>	 (CR) Ottomata: [WIP] Add Apps session metrics job (2 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns)
[13:31:25] <wikibugs>	 Analytics-Tech-community-metrics: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1202709 (Aklapper) p:Normal>High
[14:08:03] <grrrit-wm>	 (PS7) Nuria: [WIP] Add Apps session metrics job [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns)
[14:13:42] <grrrit-wm>	 (CR) Nuria: [WIP] Add Apps session metrics job (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns)
[14:18:46] <ottomata>	 nuria: btw, that way I was outputting to hdfs is probably not a very good way
[14:18:51] <ottomata>	 there is likely somehting better
[14:18:57] <ottomata>	 that was just the first thing I found :)
[14:19:31] <nuria>	 ottomata: no worries is that i do not like to upload untested code and the way the uncompress is failing doesn't let the job get to thta stage
[14:19:41] <ottomata>	 aye ok
[14:19:42] <ottomata>	 cool
[14:19:45] <ottomata>	 ja let's figure that out
[14:35:19] <wikibugs>	 Analytics-Kanban, VisualEditor: Schema:Edit seems to incorrectly set users as anonymous {lion} - https://phabricator.wikimedia.org/T92596#1202865 (Milimetric) When looking for stuff like this, it's good to keep in mind that visual editor and wikitext instrumentation are very different, and to look at them...
[14:38:18] <milimetric>	 mforns / joal: let me know if you want to talk EL
[14:38:27] <joal>	 I definitely want milimetric
[14:38:32] <mforns>	 milimetric, sure
[14:38:35] <joal>	 Just need to pick a time :)
[14:38:55] <mforns>	 joal, milimetric we can combine a time?
[14:39:09] <joal>	 oh yeah :)
[14:39:19] <milimetric>	 ok, I would love to eat something quickly and then?
[14:39:38] <milimetric>	 how about in 20 minutes?
[14:39:40] <mforns>	 8:30h PT?
[14:39:50] <halfak>	 milimetric, I can't find a phab bug for the Schema:Edit session ID issue.  Do you know if one exists?
[14:40:01] <mforns>	 milimetric, in 20 mins is ok for me too.
[14:40:17] <milimetric>	 halfak: there's no bug for that, I sent them an email to ask them their thoughts before I filed one
[14:40:21] <milimetric>	 they haven't responded yet
[14:40:24] <halfak>	 Gotcha.
[14:40:28] * halfak prepares to file
[14:40:31] <milimetric>	 halfak: as with all things, there are two session id issues
[14:41:07] <milimetric>	 1. visual editor: the session id used to be duplicated because of safari js problems, no longer is the case, but either way the bug has a simple work around (use clientIp to de-duplicate)
[14:41:48] <milimetric>	 2. wikitext: multiple sessions appear to be happening within the same sessionId (not a simple duplication of the sessionId, but actual distinct editing sessions with the same user but different articles)
[14:41:51] <milimetric>	 halfak: ^
[14:42:00] <halfak>	 Sounds like one issue
[14:42:04] <halfak>	 Two effects
[14:44:59] <ottomata>	 https://issues.apache.org/jira/browse/SPARK-4105
[14:47:47] <halfak>	 milimetric, is your sense that we don't need to worry about de-duplicating for VE anymore?
[14:50:31] <milimetric>	 halfak: yes, no need to de-duplicate
[14:50:51] <milimetric>	 but halfak it's two issues because there are two totally separate instrumentations, with two different people who worked on them
[14:51:08] <milimetric>	 however, as you say, the first issue is not an issue anymore really
[14:52:09] <halfak>	 milimetric, "two totally separate instrumentations, with two different people" that sounds like an issue in and of itself.
[14:52:18] <halfak>	 But maybe inevitable.
[14:52:39] <milimetric>	 :) this needed to be split up into a few schemas, that would've been better
[14:53:43] <halfak>	 I need an "I told you so jar" so that we can fill it with money and then take it to the bar and fill our glasses with our tears.
[14:54:15] <nuria>	 halfak: session duplication due to js problems is solved in browsers that support crypto api (not all browsers), now that doesn't solve issues with the instrumentation code itself like milimetric said
[14:54:34] <halfak>	 nuria, that was confusing.
[14:54:45] <halfak>	 I need to know if the issue has been resolved and I'm less concerned about the mechanism
[14:54:49] <halfak>	 Is it resolved for VE?
[14:55:09] <nuria>	 halfak: it's not a VE issue, it's a js one, documentation here:
[14:55:20] <halfak>	 Well, it is an issue with VE's documentation
[14:55:26] <halfak>	 *logging
[14:55:42] <halfak>	 If there's an issue in JS, it needs to be worked around.
[14:56:47] <nuria>	 halfak: it alredy is, halfak but only for browsers that support crypto api
[14:56:49] <nuria>	 https://github.com/wikimedia/mediawiki/blob/master/resources/src/mediawiki/mediawiki.user.js#L54
[14:57:27] <nuria>	 halfak: Again, it's not a VE Issue, it's an issue with JS, older browsers have no gurantees when it comes to give you a unique  id
[14:57:57] <nuria>	 halfak: to the extent it can be work arround it already is.
[14:58:44] <halfak>	 nuria, to what extent is it worked around.  How do I know what to expect in the logs?
[14:59:20] <halfak>	 I need to know under what circumstances the documentation here is wrong: https://meta.wikimedia.org/wiki/Schema:Edit
[14:59:34] <nuria>	 halfak: please take a look at docs, again, it is no issue for browsers that support crypto API. Explained here: https://github.com/wikimedia/mediawiki/blob/master/resources/src/mediawiki/mediawiki.user.js#L54
[15:00:01] <halfak>	 nuria, editSessionId is defined as "A string of 32 alphanumeric characters, unique to the current page view session; used for grouping events."
[15:00:10] <milimetric>	 mforns: batcave for EL?
[15:00:16] <mforns>	 milimetric, yessir
[15:00:22] <halfak>	 It seems that is wrong nuria
[15:00:28] <halfak>	 That is an issue
[15:00:58] <halfak>	 Also, we can't expect schema users to look for blocks of documentation in JS code.
[15:01:51] <nuria>	 halfak: i think we are talking pass each other, any js developer knows about issues with math.random in js
[15:01:59] <nuria>	 halfak: they are older than the times
[15:02:41] <halfak>	 nuria, OK, but this is logging.  The logging does not reference math.random.
[15:02:58] <halfak>	 Also, I'm a JS developer and I haven't don't anything with random in JS, so I don't know about the issue.
[15:03:32] <halfak>	 Further, the JS dev who implemented the hash for that schema apparently wasn't familiar with the issue.
[15:03:52] <halfak>	 Really, I don't care.  The logging needs to be right or documented to explain when it is not.
[15:04:02] <halfak>	 Implementation of the logging is a separate concern.
[15:04:16] <nuria>	 halfak: the logging (of any schema) uses js code, already present on mediawiki, which i sdocumented, and alex alredy knows about that, there is no issue there that i can see.
[15:04:38] <ottomata>	 ah weird, yeah, nuria, joal, now I have more exectuors!
[15:04:43] <nuria>	 ottomata: brb
[15:04:43] <halfak>	 nuria, I don't know how to use the logging.
[15:04:54] <halfak>	 I don't know how the logging was implemented and I shouldn't have to.,
[15:04:58] <halfak>	 This is an issue
[15:06:19] <wikibugs>	 Analytics-Tech-community-metrics, ECT-April-2015: KPI pages in korma need horizontal margins - https://phabricator.wikimedia.org/T88670#1203025 (Lcanasdiaz) Guys, the panels with reviewed CSS have been updated a few days ago. Could you please check if this fits your needs?  Best.
[15:08:40] <nuria>	 halfak: i think milimetric can explain in detail, the loggin of unique ids is the same for every single EL schema
[15:09:02] <halfak>	 nuria, yes.  Dan and I are working together on logging issues.  This is why I am here talking to him about them.
[15:09:13] <nuria>	 halfak: wikigrok, hoovercards.. all of them use the same mw method
[15:09:48] <nuria>	 ottomata: did it fail for you?
[15:09:53] <halfak>	 nuria, cool.  So they all have this bug or do their schemas explain the problem?
[15:10:17] <halfak>	 I suppose it is still a bug even if it is documented -- just a known issue at that point.
[15:11:35] <nuria>	 halfak: right, it is a know issue  by tdevs, yes
[15:11:47] <nuria>	 ottomata: did job fail for you?
[15:12:07] <halfak>	 nuria, too bad the users of EL aren't always devs ;)
[15:14:32] <nuria>	 ottomata: should I give it a try removing the kyro serialization?
[15:15:23] <ottomata>	 nuria:  yes, but on the  QTree can not accept negative values again
[15:15:42] <nuria>	 ottomata: did you run it for 1 day and it worked?
[15:15:52] <ottomata>	 one day, it failed with the qtree thing
[15:16:04] <nuria>	 ottomata: then it pas the shuffle page.....
[15:16:07] <nuria>	 *passed
[15:17:12] <ottomata>	 nuria, i'm going to try again with Kryo and with filtering out negatives to qtree
[15:17:19] <nuria>	 ottomata: k
[15:19:10] <nuria>	 ottomata: there must be a bug with processing of data for negative values to make it there, but to troubleshoot that I can do better on spark shell with a small dataset
[15:20:03] <nuria>	 ottomata: problem i see then is that the uncompress errors then are sporadic right?
[15:35:07] <joal>	 join #wikimedia-operations
[15:35:10] <joal>	 oops
[15:37:08] <nuria>	 ottomata: no luck, in spark shell -with data for 1 day- i get ooms:
[15:37:29] <nuria>	 sorry, with data for 1 hour
[15:37:43] <nuria>	 ottomata: let me see, maybe code is at fault here
[15:38:20] <ottomata>	 ?
[15:42:07] <joal>	 ottomata: Need your advise here
[15:42:18] <wikibugs>	 Analytics-Cluster, Ops-Access-Requests, operations, Patch-For-Review: Requesting access to analytics-users (stat1002) for Jkatz - https://phabricator.wikimedia.org/T94939#1203191 (Andrew)
[15:42:22] <ottomata>	 joal, ja?
[15:42:30] <joal>	 I'd like to receive alarms fot eventlogging
[15:42:34] <ottomata>	 k
[15:42:40] <joal>	 ottomata: Seems to be some puppet change
[15:42:47] <joal>	 Shall I go for a pull request ?
[15:43:00] <ottomata>	 yeah, hmmmm, i can do it real quick
[15:43:00] <joal>	 ottomata: Or Is there a defined process
[15:43:31] <joal>	 I'll ask access to eventlog1001 through ops-requests-access channel
[15:44:00] <ottomata>	 yeha that is good
[15:44:14] <joal>	 And I let you set up the alarm thing ?
[15:44:56] <ottomata>	 yup, on that now, i just need to add a nagios contact and put you in the analytics contactgroup
[15:45:10] <joal>	 OkThx a lot :)
[15:52:03] <ottomata>	 joal, i think --executor-cores does work on yarn?
[15:52:03] <ottomata>	 https://yarn.wikimedia.org/proxy/application_1424966181866_84081/executors/
[15:52:07] <ottomata>	 Active Tasks
[15:52:11] <ottomata>	 is 2 for all executors
[15:52:38] <ottomata>	 nuria: OOMs?
[15:52:46] <joal>	 ottomata: Right
[15:53:34] <joal>	 ottomata: What's cool is that it takes the number of taskes associated with the number of cores per executors !
[15:53:39] <joal>	 That's goos :)
[15:55:49] <nuria>	 ottomata: yes, when running the combineByKey code,
[15:55:55] <nuria>	 ottomata: retrying that though
[15:57:51] <ottomata>	 i am also still running a job, so far so good
[16:00:00] <nuria>	 ottomata: k, still OOMs in the spark shell
[16:01:00] <ottomata>	 ah in shell
[16:01:04] <ottomata>	 in yarn mode?
[16:06:28] <ottomata>	 nuria
[16:06:30] <ottomata>	 one hour ran
[16:06:32] <ottomata>	 and finished
[16:06:33] <ottomata>	 soryr
[16:06:34] <ottomata>	 one day*
[16:06:35] <ottomata>	 Map(count -> 687147, quantiles -> List((11000.0,11001.0), (21000.0,21001.0), (40000.0,40001.0), (73000.0,73001.0), (127000.0,127001.0), (217000.0,217001.0), (382000.0,382001.0), (700000.0,700001.0), (1345000.0,1345001.0), (3561472.0,3562496.0)), geometric mean -> 0.0, maxima -> 26148000, arithmetic mean -> -7.415308149624453E8, minima -> -63555840010000)
[16:07:34] <nuria>	 ottomata: me no compredou, did you change anything on spark-submit?
[16:07:58] <ottomata>	 no not on the cli, i did change the code slightly, to filter out negative values for qtree
[16:08:04] <ottomata>	 and to add some cli opts, but that's it
[16:08:05] <ottomata>	 will paste
[16:08:27] <ottomata>	 https://gist.github.com/ottomata/39958e85f9d79bbb44db
[16:08:34] <wikibugs>	 Analytics-EventLogging, Hovercards: Clicks on cite refs generates JavaScript errors in eventLogging when popups is active - https://phabricator.wikimedia.org/T88784#1203304 (Fomafix) Open>Resolved a:Fomafix The described problem does not exist anymore. It seams to be fixed.
[16:08:46] <ottomata>	 the compiled jar is at stat1002:/tmp//tmp/rj-otto.jar
[16:09:16] <ottomata>	 i ran as
[16:09:16] <ottomata>	 spark-submit --class=AppSessionMetrics --master yarn --num-executors=12 --executor-cores=2 --executor-memory=4g /tmp/rj-otto.jar 'webrequest_source="mobile" and year=2015 and month=03 and day=20'
[16:09:47] <mforns>	 nuria, I'm looking at your patch for wikimetrics, what was the problem? What is it fixing?
[16:10:04] <ottomata>	 main change is this nuria
[16:10:05] <ottomata>	 https://gist.github.com/ottomata/39958e85f9d79bbb44db#file-appsessionmetrics-scala-L75
[16:10:11] <ottomata>	 .filter(_ >= 0)
[16:10:20] <nuria>	 mforns: go to prod and check out members of  a cohort
[16:10:22] <ottomata>	 but, i don't see why that would avoid the weird compression error
[16:10:30] <mforns>	 nuria, ok
[16:11:19] <nuria>	 ottomata: no, me neither, i got the compression error BEFORE getting to qtree
[16:11:32] <nuria>	 ottomata: is your user priviledged some how?
[16:12:53] <ottomata>	 nope
[16:13:08] <nuria>	 ottomata: man....
[16:23:03] <nuria>	 ottomata: i still get ooms in spark-shell if i do .take(10) after combineByKey with data for 1 hour
[16:24:17] <ottomata>	 are you running spark-shell in yarn or local mode?
[16:24:26] <ottomata>	 nuria: ?
[16:24:48] <nuria>	 the default, which i think is local mode
[16:24:51] <ottomata>	 aye
[16:24:54] <ottomata>	 that makes sense then
[16:25:42] <ottomata>	 1.9G	hour=0/
[16:25:53] <ottomata>	 i dunno what the default jvm size is
[16:25:56] <ottomata>	 probably a gig or something
[16:26:03] <ottomata>	 for spark shell
[16:26:23] <ottomata>	 nuria...you could test again using just parquetFile
[16:26:32] <ottomata>	 and specifying a single file, rather than a whole hour
[16:27:34] <ottomata>	 e..g
[16:27:35] <ottomata>	  hdfs dfs -ls -h /wmf/data/wmf/webrequest/webrequest_source=mobile/year=2015/month=3/day=20/hour=0/000000_0
[16:27:35] <ottomata>	 -rwxr-xr-x   3 hdfs analytics-privatedata-users     30.2 M 2015-03-20 02:07 /wmf/data/wmf/webrequest/webrequest_source=mobile/year=2015/month=3/day=20/hour=0/000000_0
[16:27:39] <ottomata>	 that is only 30 M
[16:27:48] <nuria>	 ottomata: ok, let me see , need to try enough data to get to the error of why qtree is dealing with negative numbers
[16:28:39] <nuria>	 ottomata: will do that if cannot work around it any other way
[16:28:56] <nuria>	 ottomata: so .. we are not worrying about the "uncompress" exceptions i was getting then?
[16:29:14] <nuria>	 ottomata: did you modified the spark-submit command?
[16:29:26] <ottomata>	 nuria: the spark command I submitted is in that gist
[16:29:27] <ottomata>	 and no
[16:29:33] <Ironholds>	 negative numbers?
[16:29:34] <ottomata>	 aside from passing it the patition predicate as an input string
[16:29:42] <Ironholds>	 wait, is this the session stuff?
[16:29:50] <Ironholds>	 I know why it's got negative numbers
[16:29:51] <ottomata>	 https://gist.github.com/ottomata/39958e85f9d79bbb44db
[16:30:30] <Ironholds>	 if the session reconstruction approach followed my methodology, it uses -1 in session length to indicate an incalculable (i.e., one-event) session
[16:30:41] <Ironholds>	 ditto time-on-page or actual intertimes
[16:30:55] <Ironholds>	 ...you probably want to filter out values of -1 prior to running aggregators
[16:31:00] <nuria>	 ottomata: ok, i see,so you submitted it to "yarn" mode rather than client mode
[16:31:26] <grrrit-wm>	 (CR) Mforns: [V: -1] "The test" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/203505 (https://phabricator.wikimedia.org/T93023) (owner: Nuria)
[16:32:47] <grrrit-wm>	 (CR) Nuria: "Argh....very true. Need to remove some code from both those two so encoding is just handled by these changes at flask layer. Thanks for ca" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/203505 (https://phabricator.wikimedia.org/T93023) (owner: Nuria)
[16:34:27] <ottomata>	 nuria: that is what you did in your command too, no?
[16:34:32] <ottomata>	 the one you pasted me in the gchat
[16:34:46] <ottomata>	 but ja, if you are ever running on larger datastes, you will probably always have to run in yarn
[16:34:52] <ottomata>	 which i think you can do with spark-shell too
[16:35:07] <ottomata>	 not exactly sure how that works or when your resources get allocated in spark shell
[16:35:09] <ottomata>	 but i thnk it works
[16:36:15] <nuria>	 ottomata: yes, but that is why i do not get why you do not see the compression errors if we are running teh same command... argh, will retry
[16:36:35] <nuria>	 Ironholds: k
[16:36:56] <ottomata>	 nuria: not sure either.
[16:37:42] <grrrit-wm>	 (CR) Mforns: "It is really cool that it will be handled at flask level!" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/203505 (https://phabricator.wikimedia.org/T93023) (owner: Nuria)
[16:46:38] <nuria>	 argh, everything looks bad today, looks like logster counts also did not make it to graphite labs
[16:52:33] <mforns>	 milimetric, can you pass me a sample query of those you did to discover the lack of data in the eventlogging db?
[17:05:00] <wikibugs>	 Analytics, Ops-Access-Requests, operations: Grant Sati access to geowiki - https://phabricator.wikimedia.org/T95494#1203559 (Shouston_WMF) @Ottomata, ultimately I'm simply assisting Winifred Olliff and Katy Love with obtaining this data for them to perform analysis. Perhaps if Joady hasn't gotten back...
[17:12:59] <wikibugs>	 Analytics-EventLogging, Analytics-Kanban: Investigate EventLogging Monitoring with Ops DBA {oryx} - https://phabricator.wikimedia.org/T86200#1203583 (kevinator)
[18:12:08] <wikibugs>	 Analytics-EventLogging: agent_type field does not work for anything except last few hours - https://phabricator.wikimedia.org/T95806#1203734 (kevinator) Only rows marked with record_version = "0.0.3" contain data in the new columns.  https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Changes_and_k...
[18:25:00] <wikibugs>	 Analytics-Cluster, operations, ops-eqiad: analytics1020 hardware failure - https://phabricator.wikimedia.org/T95263#1203785 (Cmjohnson) Replacement board sent was refurbished and had an idrac error. Failed to load idrac. Most likely because it was never reset.  A new board will be sent tomorrow.
[18:49:49] <nuria>	 mforns, Ironholds: yt? question about session code
[18:49:49] <Ironholds>	 nuria, yep
[18:49:49] <nuria>	 mforns, Ironholds : users that just have 1 visit (1 hit) in the period we are interested on, are they
[18:49:49] <nuria>	 Ironholds: counted towards sessions?
[18:49:49] <nuria>	 sorry
[18:49:50] <nuria>	 Ironholds: counted towards "sessions per user"
[18:49:50] <nuria>	 that is 1hit (1 page lookup) should count +1 towards that uuid session?
[18:49:50] <Ironholds>	 uhm
[18:49:51] <Ironholds>	 that is, they had 1 event with a 30-60 minute window on each side?
[18:49:51] <Ironholds>	 (depending on which issue you're dealing with)?
[18:49:51] <Ironholds>	 *which definition
[18:49:51] <nuria>	 ya , or days right?
[18:49:51] <nuria>	 i go look up "page1" monday and "page2" tuesday
[18:49:51] <Ironholds>	 they would be included in the sessions for the purpose of calculating say bounce rate
[18:49:51] <nuria>	 that's it
[18:49:51] <Ironholds>	 but they would not be included for session length
[18:49:51] <Ironholds>	 if this is "why are there negative values?" - see above re, -1 as a holding value potentially being the source
[18:49:51] <Ironholds>	 if not then: ask whatever questions you've got!
[18:49:52] <nuria>	 Ironholds: what i was thinking is that there is no need tomark those as "-1" rather if
[18:49:52] <Ironholds>	 if we only want session length then we can drop them
[18:49:52] <nuria>	 in the timeperiod you are interested there is only one timestamp for that uuid that should get filter 1st, that is fileter #1
[18:49:52] <Ironholds>	 if we want sessions per user or bounce rate, though, we need those values
[18:49:52] <nuria>	 Ironholds: no bounce
[18:49:52] <Ironholds>	 sessions per user?
[18:49:52] <nuria>	 lemme see
[18:49:53] <nuria>	 sessions per user yet
[18:49:53] <nuria>	 *yes
[18:49:53] <Ironholds>	 then you'll need one-event sessions
[18:49:53] <nuria>	 k
[18:49:54] <Ironholds>	 I'd suggest filtering them out pre-session length
[18:49:54] <Ironholds>	 or depending on how you're structuring it:
[18:49:54] <Ironholds>	 you could calculate the number of sessions for [user] while sessionising their events
[18:49:54] <Ironholds>	 and then return that with [list of sessions, without one-event sessions]
[18:49:54] <Ironholds>	 and that way you wouldn't have to do any kind of filtering, just, extract the session_number and mean() all of those, and pass the actual sessions into the session length code
[18:49:54] * Ironholds shrugs. Y'all know the infrastructure and conventions here better than me :)
[20:15:11] <nuria>	 ottomata: ok, finally figured out why qtree was getting negative values, there were two bugs on the code
[20:15:11] <nuria>	 ottomata: one trivial to fix with no impact in perf
[20:15:11] <nuria>	 ottomata: the other one also "easy" to fix but with perf impact as we need to keep track of a sorted list where there was no sorted list before
[20:15:16] <ottomata>	 hm
[20:15:16] <ottomata>	 weird, what's up?
[20:15:16] <ottomata>	 what's the bug?
[20:15:16] <nuria>	 ottomata: it os the combineByKey behaviour (not a bug)
[20:15:17] <ottomata>	 hm
[20:15:17] <ottomata>	 what's it do?
[20:15:17] <nuria>	 we were only sorting items if they are found across partititions
[20:15:17] <nuria>	 before:
[20:15:17] <nuria>	 https://www.irccloud.com/pastebin/XSAyJcwH
[20:15:17] <nuria>	 I think correct is:
[20:15:18] <nuria>	 https://www.irccloud.com/pastebin/bwB7zmgc
[20:15:19] <ottomata>	 hm, i don't understand fully, but that is ok!
[20:15:19] <ottomata>	 :)
[20:15:19] <nuria>	 ottomata: hopefully i am not totally off here cc mforns
[20:15:19] <mforns>	 hey
[20:15:20] <mforns>	 nuria, I thought that the timestamps were already ordered, because the key of the combineByKey input is a pair (uuid, timestamp)
[20:15:20] <mforns>	 so, I guess spark should sort by the two
[20:15:21] <mforns>	 I put it there, because it made sense to me to use the map reduce capabilities to sort that instead of using .sort()
[20:15:21] <mforns>	 *.sorted
[20:15:21] <nuria>	 mforns: the combineByKey is using (uuid, timestamp) and agreggating on uuid correct?
[20:15:21] <ottomata>	 hm, interesting, scala's list does now how to sort tuples!
[20:15:21] <nuria>	 mforns: if all timestamps are found in the same partition
[20:15:21] <mforns>	 nuria, yes, the implicit part of combineByKey uses the pair (uuid, timestamp) to collect the timestamps by key, that's why I think that the timestamps are ordered
[20:15:21] <nuria>	 they will be like (uuid, t1, t3, t4, t6)
[20:15:21] <mforns>	 nuria, exactly
[20:15:21] <nuria>	 as there is no guarantee that in your dataset those are ordered
[20:15:21] <mforns>	 nuria, you're right
[20:15:22] <nuria>	 so -if all timestamps are in the same partition- at the end of combineByKey you have a (uuid, <unordered dataset>)
[20:15:22] <nuria>	 ok, and since in the last patch we "removed" the subsequent order this function has to take care of it all
[20:15:22] <mforns>	 nuria, so therefore I'd say that only the second .sorted is necessary, which reshufles the sets of each partition into a sorted list again
[20:15:22] <mforns>	 nuria, wait
[20:15:22] <nuria>	 but the second "sorted" it is not called if values are
[20:15:22] <mforns>	 batcave?
[20:15:22] <nuria>	 in the same partition
[20:15:22] <nuria>	 k
[20:16:22] <wikibugs>	 Analytics-Kanban, VisualEditor: Schema:Edit seems to incorrectly set users as anonymous {lion} - https://phabricator.wikimedia.org/T92596#1204378 (Halfak) Indeed. Thanks @milimetric.  Here's a query that summarizes the problem.    ``` > SELECT     ->   event_editor,     ->   rev_user = 0,      ->   sum(`e...
[21:05:30] <wikibugs>	 Analytics-Cluster: Hue shows error from varnish when issuing Hive query - https://phabricator.wikimedia.org/T95953#1204658 (Ottomata) NEW a:Ottomata
[21:09:33] <mforns>	 ottomata, yt?
[21:11:00] <ottomata>	 yes hiya
[21:11:02] <ottomata>	 mforns:
[21:11:25] <mforns>	 hey, I was looking at something to log mysql stats to statsd
[21:11:31] <mforns>	 and found this: https://github.com/spilgames/mysql-statsd
[21:11:53] <mforns>	 do you have another idea for EL mysql, that would send stats to graphite?
[21:12:26] <mforns>	 ottomata, this is a python package, I understand there's problems when puppetizing that, right?
[21:13:27] <YuviPanda>	 mforns: diamond, which we already use, ships with https://github.com/BrightcoveOS/Diamond/wiki/collectors-MySQLCollector I think
[21:13:31] <YuviPanda>	 so can be turned on if needed
[21:13:43] <YuviPanda>	 (and https://github.com/BrightcoveOS/Diamond/wiki/collectors-MySQLPerfCollector too I think)
[21:13:44] <mforns>	 oh! awesome!
[21:13:49] <YuviPanda>	 not sure if those are the stats you are looking for
[21:13:59] <mforns>	 I look for #inserts
[21:14:50] <ottomata>	 hm, uhhh
[21:15:00] <ottomata>	 i am not sure, you want # of inserts on a speciifc table?  or in a db?
[21:15:34] <ottomata>	 mforns: these tables are insert only...do they have an auto-increment id on them?
[21:15:40] <ottomata>	 you could calculate that too...per table.
[21:15:56] <mforns>	 ottomata, you're right
[21:15:58] <ottomata>	 (current_max_id = last_max_id) / duration
[21:16:20] <ottomata>	 could do count(*) where timestamp > last_time
[21:16:25] <ottomata>	 but ids are probaly faster
[21:16:27] <nuria>	 mforns, ottomata : we just removed the autoinsert id, so no
[21:16:35] <mforns>	 ops
[21:16:35] <ottomata>	 whaaa why you remove autoinsert id?
[21:16:36] <ottomata>	 :)
[21:16:42] <nuria>	 mforns, ottomata : at springle request as it was a waste of space
[21:16:58] <nuria>	 mforns, ottomata: old tables will still have it, new tables will not
[21:17:06] <mforns>	 yea remember
[21:17:22] <ottomata>	 psshhhh
[21:17:39] <nuria>	 mforns: but careful, cause you do not have access to db mforns , only to EL box
[21:17:47] <mforns>	 so ottomata you were suggesting that because diamond will not count that?
[21:17:49] <ottomata>	 when did we do the philly hackathon? milimetric?  was that the first week of march?
[21:17:58] <ottomata>	 mforns: i dunno what diamond can do
[21:17:59] <ottomata>	 or does :)
[21:18:12] <mforns>	 nuria, milimetric pointed me to the DB creds file
[21:18:38] <mforns>	 ottomata, ok will look at diamond
[21:18:58] <milimetric>	 ottomata: March 2nd to 6th
[21:19:27] <nuria>	 that is from EL box as the EL user, so it is not wide open access that you could setup mysql to log
[21:19:28] <ottomata>	 danke
[21:19:35] <nuria>	 mforns: if it makes sense
[21:20:05] <mforns>	 nuria, I see
[21:20:32] <nuria>	 mforns: you have insert access to the mysql EL db on the master but nothing beyond that
[21:20:53] <milimetric>	 mforns: omg, so sorry I missed your message earlier, here's the query:
[21:20:57] <milimetric>	 set @from1 = '20150409171248';
[21:20:57] <milimetric>	 set @to1 = '20150409184948';
[21:20:57] <milimetric>	  select 'DeprecatedUsage_7906187', count(*) as events, left(timestamp, 11) ts from DeprecatedUsage_7906187 where timestamp between @from1 and @to1 group by ts;
[21:21:13] <milimetric>	 (and play with the time span)
[21:21:26] <mforns>	 milimetric, don't worry, I got a very similar query on edit tables
[21:21:30] <milimetric>	 k
[21:22:13] <mforns>	 milimetric, as you said you were querying the information_schema db, I thought you were creating ninja dynamic queries across all log tables.
[21:24:36] <milimetric>	 mforns: oh yeah, but not ninja at all
[21:25:02] <milimetric>	 it looked something like "select *that query I just pasted* from information_schema.tables where table_schema = 'log'"
[21:27:06] <mforns>	 milimetric, you put a query inside the select statement?
[21:27:38] <milimetric>	 not a query but like a string that I then use as a query
[21:27:56] <milimetric>	 like select concat('select count(*) from ', table_name) from ...
[21:31:06] <mforns>	 milimetric, I see, that *is* ninja
[23:00:55] <grrrit-wm>	 (PS1) Yurik: Fixed broken weblogs2.py [analytics/zero-sms] - https://gerrit.wikimedia.org/r/203986
[23:01:19] <grrrit-wm>	 (CR) Yurik: [C: 2 V: 2] Fixed broken weblogs2.py [analytics/zero-sms] - https://gerrit.wikimedia.org/r/203986 (owner: Yurik)
[23:10:00] <wikibugs>	 Analytics-Cluster: Installing Spark 1.3 on Cluster - https://phabricator.wikimedia.org/T95970#1205297 (ellery) NEW
[23:11:15] <wikibugs>	 Analytics-Cluster: Installing Spark 1.3 on Cluster - https://phabricator.wikimedia.org/T95970#1205305 (ellery) Is this a a pain?
[23:14:59] <wikibugs>	 Analytics-Cluster, Analytics-Kanban: Report pageviews to the annual report - https://phabricator.wikimedia.org/T95573#1205318 (Yurik) I am currently running a query that generates pageviews and total download size per day per country per proxy per host. The data is already half-way there. I think it would...