[00:03:10] halfak: the oozie jobs filling in the webrequest table have been falling further and further behind today. Right now they are 4 hours behind where they should be. I'm guessing its because the cluster is slammed. Do you have a sense of when your streaming jobs will finish? [00:04:36] I'm antsy because there is a bug in the FR infrastructure that is showing people banners that should not see them. A few patches where deployed to fix this this morning and I need those recent hive partitions to diagnose the issue [00:26:17] * Ironholds blinks [00:26:31] ewulczyn, if you read your email you will note that Halfak is actively ill, and is not here. [00:27:11] Halfak is currently running two jobs that I can see, and neither are crazy-crazy-big [00:27:28] have you checked with the analytics engineers and ops to confirm what the problem is? [00:29:28] hmm, what metric are you using to see how many resources they are using? [00:30:18] The streaming jobs are taking up half the available containers [00:30:41] hadoop job -list [00:30:43] and, sure [00:31:03] but last time we had hadoop capacity problems there were ZERO available containers and jobs wouldn't run at all [00:31:10] and I don't recall critical issues around new data coming in [00:31:22] so I'd suggest checking with otto and qchris before declaring a diagnosis [00:32:27] data will not be lost, but possibly delayed [00:32:53] you can check to see how full the cluster is [00:33:04] if you do the ssh tunnel thing [00:33:09] ssh -N bast1001.wikimedia.org -L 8088:analytics1010.eqiad.wmnet:8088 [00:33:15] and go to [00:33:26] http://localhost:8088/cluster/scheduler [00:33:42] also if we're doing diagnoses now can anyone tell me where I put my keys? [00:33:50] haha [00:33:54] i see lots of people running jobs [00:35:00] 2 halfak, 2 nuria, yurik , west1 (bob), ellry has 3, 4 qchris [00:35:01] etc. [00:35:04] oop, i have to [00:35:04] byebye [00:35:06] :) [00:35:10] no, seriously, I'm meant to be delivering christmas presents and I can't find them [00:38:42] ewulczyn, we can talk about this with otto tomorrow, but relying on the cluster to be available and up to date to diagnose prod issues might not be the best strategy [00:38:54] ewulczyn: remember hadoop is not tier-1 [00:40:21] Sure lets do that. I would also like to talk about scheduling. I think our scheduler does not allocated resources between running jobs. It looks to me like once a job has resources, it never has to give them up. [00:44:35] ewulczyn: i think a job that is not running in hadoop should only hold minimal scheduling resources, but otto is the one that knows [00:48:25] ewulczyn: but in any case hadoop can go down and being tier-2 means that no mediawiki functionality should rely on it for real time decisions [00:50:21] what exactly did fundraising do before hadoop? [00:50:25] that. Do that thing. [00:50:31] and if that thing is "do nothing and hope"... [00:52:39] nuria: sure, I'm just trying to help FR debug something. I was not planning on doing it this way or relying on hadoop for real time decisions. Normally there the logs are in hive with a 2-3 hour delay. Today there is a lag and I was trying to get a sense for whether I should expect the lag to increase. [00:53:27] ewulczyn: could you use sampled logs in the interim (likely not but .... just throwing it out there...) [00:53:54] those are by definition a day out of date, I thought [00:54:25] its ok, it can wait unitl tomorrow. thank you guys for your suggestions [00:55:58] ewulczyn: Ok, let's talk tomorrow, ciao! [02:18:07] so nobody has seen my keys? [06:00:58] Ironholds: I’ve seen your keys! They’re in LDAP [06:01:03] * YuviPanda runs away slowly [06:01:04] Ironholds, I hid your keys in the last place you'll look. [06:01:14] dangit, timing! [06:01:25] hahaha [12:51:34] quiddity, good! [14:19:28] o/ Ironholds [14:19:44] I see that my hadoop jobs have been cause for some worry. [14:22:41] hey halfak! [14:22:51] I'mma do that caching study over christmas, btw [14:23:12] Do you have something instrumented for it? [14:23:22] the NavigationTimings schema that's still running [14:23:37] actually...wait. I have an idea [14:23:39] mwahahahahahaa [14:25:11] I wonder if I can just grab the eventlogging JS triggers from the request logs [14:25:17] that way I don't need to worry about IP sanitisation [14:26:29] So you're comparing requests for the navtiming JS with successfully recorded EL events? [14:27:23] comparing successfully recorded NavigationTiming events for views with completed requestlog entries, is the plan [14:27:43] throw the NavigationTimings schema into Hive, LEFT JOIN on to webrequest where agent, uri_path and uri_host match [14:27:54] but the NT IP addresses are sanitised which limits the amount of information I can get out [14:28:00] so I'm trying to think on ways to get around that. hmn. [14:33:31] Ahh. I see. [14:34:44] Is the sanitization a simple one-way hash? [14:34:48] Ironholds, ^ [14:35:17] yep, and it refreshes every 90 days [14:35:24] so if I'm allowed to get the key I can crack it trivially [14:35:30] (thank god for vectorised sha1 generation!) [14:35:37] I knew writing that library would be useful [14:36:01] Oh. So it's a salted hash. [14:36:15] oh, gotcha. Misread. Yeah, tis salted. [14:36:55] Ironholds, Navtiming only runs for a sample of requests. How will you know that a request was selected? [14:37:52] by throwing the navigation timing data into hadoop and joining on to the webrequests data from the same period [14:38:13] if there's an NT entry with no corresponding webrequest entry, the page request never made it server-side [14:38:45] I see. But the opposite case wouldn't be detectable. [14:39:04] I guess you can reason based on the sampled proportion. [14:39:27] Oh say, I do you know what the conclusion of Hadoop troubles was yesterday? [14:39:30] Is it my jobs? [14:39:52] I don't know! [14:40:19] and, yeah, the opposite would be true, but I think that's okay [14:41:10] the question is what proportion of user requests for content are handled by intermediaries. That's the delta between the requests triggered regardless of caching (i.e., JS) and the requests triggered only if there is no caching (webrequest entries for the same pageview) [14:41:45] (no caching before varnish) [14:41:53] But yeah... I'm with you. [14:42:01] Also, timestamps might not match. [14:42:55] You might not be able to match requests with perfect accuracy -- even if the salt is used. [14:47:44] yeah, I'm not using timestamps [14:47:59] that's why I want the underlying IPs :/ [14:48:17] otherwise we've just got project, page (and there's some awkwardness and munging there) and UA [14:48:20] So, what if a user loads the same URL twice? [14:48:41] I guess you can use counts -- but wait -- NatTiming sampling is per-request. [14:54:48] oh bollocks [14:54:53] * Ironholds headscratches [14:55:07] hrm. Maybe we can use the EventLogging triggers in the webrequests? [14:55:12] but I don't know what format those take [14:55:53] Javascript tends to get aggregated, so we can't look for that. [14:55:55] Hmm... [14:56:15] see, this is one of those days when I wish analytics engineering were not horrifyingly overburdened so we could actually look at this properly. [14:56:27] actually, thinking about it, I spend every day wishing analytics engineering were not horrifyingly overburdened. [14:56:33] still! This day particularly! [14:59:30] halfak, ToAruShiroiNeko: I made a test with 1000 recent changes and got this classification report: https://gist.github.com/he7d3r/7f2aebb00e18b4963d07 [14:59:36] any ideas what could be wrong? [15:02:49] Helder, I'm not sure. It looks like the classifier is saying nothing gets reverted. Is that right? [15:03:05] yeah, that seems correct [15:03:41] I printed the predictions of a smaller set of revisions at some point and it was all "non-reverts" [15:03:42] =/ [15:04:17] Not sure what's up at a glance. I'll have to take a look later. [15:05:03] any comments about that FIXME? https://gist.github.com/he7d3r/7f2aebb00e18b4963d07#file-demonstrate_scorer_on_rc-py-L76-L77 [15:05:09] In the meantime, could you write out a dataset of features and scores? [15:05:43] Helder, looks right to me. [15:06:51] this comment confused me when writing that line: https://github.com/halfak/Revision-Scoring/blob/5c740e278c4019b08551ae00cd57cbaf98110f82/revscores/scorers/scorer.py#L28-L29 [15:08:02] Indeed. The model should already be trained by the time you pass it to the scorer. [15:10:44] but the model can only be trained after I use "linear_svc.extract" on each revision, so training the model should not be a pre-requisite to creating the "linear_svc" which is needed for this extraction of features [15:11:20] Oh. You don't need to use that method to extract the features. You can use APIExtractor directly. [15:12:17] The extract() method on MLScorer should really be private. [15:13:15] hmm... [15:13:24] I was doing something like that in the previous version: [15:13:24] https://gist.github.com/he7d3r/7f2aebb00e18b4963d07/20f95aba198f190652234e41abc025b9f0c4ee6b#file-demonstrate_scorer_on_rc-py-L78 [15:13:53] but since I found that method, I though I should use it.. [15:14:34] *thought [15:15:13] Either way, if you want to refactor the model/MLScorer pattern, I'd be happy to discuss the options with you. [15:17:20] one thing that looks weird now that I got it working is this line: [15:17:21] https://gist.github.com/he7d3r/7f2aebb00e18b4963d07#file-demonstrate_scorer_on_rc-py-L83 [15:18:06] Yup. That function is designed to allow you to extract features for a set of rev_ids. [15:18:13] And it returns an iterator. [15:18:51] the "extract" accepts a list/iterator, but I wanted to pass just a revision number, so I create list with that single element, and then after I get the result, I have to... well you see what I did there [15:19:09] Yup. This is reasonable IMO [15:19:16] on the other hand, "reverts.api.check_rev(session, rev)" doesn't accept an interator [15:19:39] *iterator [15:22:13] Indeed. [15:22:45] I suspect that it will be common that we'd like to gather features for a set of revisions and that optimizations can be done to group requests together. [15:23:14] We could write a function extract(rev_id, features) --> values and extractiter(rev_ids, features) --> iterable [15:23:20] I think that would be a fine change. [15:23:31] * halfak doesn't really like the name "extractiter" [15:24:25] halfak: what about a single function which accepts both kinds of arguments? [15:24:47] So, it would return something different depending on the arguments you gave it? [15:24:54] That doesn't sound simple or obvious to me. [15:26:34] yes? [15:26:38] whi not? [15:26:40] *why [15:27:31] I'm generally against having mixed return types for a function. [15:27:37] Why not just have two functions? [15:27:40] halfak: I was thinking about something similar to https://github.com/wikimedia/mediawiki/blob/master/resources/src/mediawiki/mediawiki.js#L1771-L1772 [15:28:36] that is what allow us to use mw.loader.using('foo') or mw.loader.using(['foo','bar']) [15:28:44] with a single function mw.loader.using [15:29:02] but I didn't think about the return values really [15:29:16] so, maybe not actually the same thing [15:29:27] Yeah. that's the real problem IMO. [15:33:20] about the dataset you mentioned, Danilo mentioned this http://pastebin.com/QJAtVrfz, which he used with his rev_features20141213.csv [15:33:39] halfak: maybe that file is enough? [15:34:04] (although it is not based on ident reverts, but on edit summaries, I think) [15:34:06] Maybe indeed. I haven't had a chance to look at it. [15:36:29] halfak: BTW, I confirm previous impressions that there is something slow when extracting these features... [15:46:49] Helder, I think it is the separate requests for revision, previous revision, previous user revision, etc. [15:48:14] Helder, I think that we can get some substantial benefits from caching. [15:48:26] Active users and active pages will have their revisions requested often. [15:48:52] possibly [15:48:56] Also, we might be able to use the DB to gather some of the data. [16:03:17] morning [16:55:53] http://tools.ietf.org/html/rfc3092 I love that this exists [17:22:02] you know, the one downside of writing this tool is I can't give it a stupid name [17:23:58] Wait... Isn't that one of the perks of making stuff -- that you get to name the things you make? [17:24:34] yeah, but I'm building an IETF RfC scraper in R [17:24:44] so I should put a reference to R in it, right? [17:24:47] but it's already RfC! [17:25:50] Why are you building an RFC scraper? [17:28:02] I wanted an excuse to test Hadley's new rvest package (beautifulsoup but for R) [17:28:22] plus I figure I could find out some interesting things. Distribution of authors, distribution of publication dates. [17:29:24] links between RfCs! [18:14:02] yo halfak [18:14:08] Hey tnegrin [18:14:18] are you working today? [18:14:28] I did want to touch base but it's not urgent [18:14:29] Yup. Feeling much better. :) [18:14:46] excellent [18:15:11] do you have a few minutes to hangout? I can also schedule something for later if you're busy [18:15:29] I'm in health check right now. [18:15:36] ok -- I'll send you an invite [19:38:02] yo halfak, just checkin in, where are we with the xmldump stufff? i don't even remmeber! [19:38:40] ottomata, tests complete. I think the reason snappy was getting more mappers was because the files were split. [19:38:52] Regardless we used substantially less CPU time, so I'm using snappy anyway. [19:38:59] I'm running diff and persistence jobs right now. [19:39:09] I have one job that has been stuck at 100% for 10 hours. [19:39:18] snappy avro? [19:39:25] snappy json. [19:39:36] Avro didn't make a difference and I don't know how to write it out anyway. [19:39:39] is that in our experiment? [19:40:36] Yup. Looking at JSON-uncompressed and Avro-uncompressed. [19:41:56] Woops. looks like that uber-long job finished. [19:42:18] ls -al [19:42:20] woops [19:42:22] heh [19:42:59] ottomata, do we have any command like tools for working with snappy files? [19:43:11] It would be really nice to be able to look at these files in 'less' [19:43:29] Yikes. That big job ended up writing out 11TB of data. [19:44:16] halfak: JSON-uncompressed is not snappy [19:44:26] halfak: you can do [19:44:27] Oh yeah. No doubt. [19:44:30] hdfs dfs -text [19:44:32] instead of -cat [19:45:21] i think the avro uncompressed data is wrong [19:45:26] i wnat to recreate that [19:45:39] if you have json snappy data, can you add that to the results? [19:45:44] OK. Either way, why is it that you think avro will be faster? [19:45:45] Sure. [19:47:47] * halfak waits for copy to complete [19:53:50] ottomata, find json-snappy in the same dir as the others. [19:54:03] I'll run another test on it quick. [19:56:32] halfak: not sure for streaming, since you are being presented json in your job anyway, so are parsing the json [19:56:58] for anything else (hive, whatever), avro should be faster, as it does not have to parse json, but read in a binary format [19:59:21] Makes sense. Also means that I won't usually operate in avro. [20:00:49] doesn't matter, if you use streaming, the input format will give it to you in json [20:01:02] halfak: you will if you use a hive table mapped on top of it :) [20:01:10] Indeed. [20:01:54] It's just that produce a lot of intermediate datasets on an ad-hoc basis. Those probably won't be avro. [20:05:58] as you like :) it actually is hard to write out to avro using hadoop streaming [20:06:01] the support isn't very good [20:06:23] halfak: we aren't going to dictate any formats for anything, so don't worry. [20:06:31] if we productionize stuff, we will choose formats [20:06:41] but for ad hoc, you can do whatever you want :) [20:07:48] Indeed. I figured as much. Just telling you my plans. [20:24:09] hey halfak http://cran.rstudio.com/web/packages/reconstructr/ [21:07:51] forgot to ask this in health check..does research have 10% time as analytics eng team does? [21:11:59] yes! [21:12:06] we are allowed to spend 10% of our spare time not-working. [21:12:29] we don't have 10% time and historically arguing for a certain portion of time to be spent on things not laid down in the plan, or directly stakeholder-answerable, has ended badly [21:12:39] because we're not consistently trusted to spend it well. [21:12:50] the phrase "blue-sky research" makes some managers break out in hives. [21:13:26] What Ironholds said. [21:13:58] I love it when people say that [21:14:01] you know the way to my heart, halfak [21:14:04] :P [21:14:46] oh, halfak, https://gist.github.com/Ironholds/1b9bace3f6c4f9189c6f might amuse you - how I spent my lunch break [21:26:02] hey halfak, would you add bernie@imgur.com to the ocs mailing list? She's working with Tim as a data analyst, and it sounds like she'll be handling some logistics and attending the workshop. [21:26:24] J-Mo, sure dude [21:27:09] thank you, good sir! Also, halfak, since I've got you: you have any objection to me telling everyone who applied for the workshop that they're accepted? [21:27:33] J-Mo, no objection. [21:27:53] Do you know bernie@imgur.com's first name? [21:27:59] Is it bernie? [21:28:06] it is indeed [21:28:13] kk. Just want to send a welcome. :) [21:28:28] and thank you once again for making my afternoon super efficient [21:29:59] No prob dude :D [21:51:27] halfak: I'm with problems to use sklearn in python3, so I tested the classifier with python2 with this code http://pastebin.com/QJAtVrfz , this returned 79.4%, what other type of tests you think is interesting to make with sklearn? [21:51:50] danilo_, in a meeting now. Will take a look in a little bit. [21:52:11] ok [22:11:25] lzia: yt? [22:11:30] you froze on the hangout [22:46:04] danilo_, I think what we really want is a https://en.wikipedia.org/wiki/Receiver_operating_characteristic [22:47:57] You could produce one by calling predict_proba() [22:50:16] ok, I will work in that [23:11:34] Deskana, looks fine [23:11:38] although "wmfuuid" is a terrible name [23:11:49] suggest keeping with AppInstallID, as with the query strings. One fewer thing to remember. [23:12:00] also it says what it is ;p [23:23:19] Going to run some errands. Back in a couple of hours. [23:26:59] I've decided my new tagline is going to be [23:27:05] Oliver Keyes, Breaker of Things, Fixer of Stuff [23:46:03] Ironholds: Alright, I'll pass that on. Thanks. :) [23:46:14] kk