[13:50:02] Good morning Science people! [13:50:26] (depending on your longitude) [14:00:57] morning! [14:02:39] o/ ottomata [14:02:41] :) [14:03:06] I'm running a hadoop job that is making API calls and it seems to be going OK :) [14:03:20] I really wish I could read the logs of a running job. [14:03:32] I wonder if spark'll help with that. [14:03:50] halfak on our cluster? [14:04:43] ottomata, negative [14:04:46] Altiscale [14:05:03] Seems to be a general problem with hadoop though. [14:05:12] I suppose I could go log into the nodes though, right? [14:06:31] ja, you can get it from nodes if you tunnel to them [14:07:04] but, you have to get them from individual containers :/ [14:07:05] i think... [14:07:06] yeah [14:07:50] hue helps a bit [14:07:55] don't need tunnels as much [14:08:09] here is a map from one of bob's stream jobs [14:08:10] https://hue.wikimedia.org/jobbrowser/jobs/job_1424966181866_89807/tasks/task_1424966181866_89807_m_003195/attempts/attempt_1424966181866_89807_m_003195_0/logs [14:09:07] ottomata, should hue be accessible externally? [14:09:19] Oh wait... just got a loggin [14:10:00] WHat account do I use? [14:10:14] It seems that my labs account doesn't work [14:11:33] shell username [14:11:41] ldap (labs/wikitech) pw [14:12:56] halfak: you get in? [14:13:27] Hmm.. Tried that. Trying again. [14:14:15] Yup. Confirmed via copy-paste that the same creds that work on wikitech do not work on hue. [14:14:21] ottomata, ^ [14:14:28] Yup == Nope [14:14:31] :S [14:16:03] username: halfak [14:16:03] ? [14:16:24] hmmm [14:16:25] try again [14:16:25] Bah! It was the capitalization [14:16:28] ah :) [14:16:36] I log in as "Halfak" on wikitech [14:16:37] yeah, it authenticates against your shell username [14:16:38] yeah [14:16:40] But "halfak" on hue [14:16:53] yeah, it is the same one you'd see in production node CLI prompts [14:16:55] that one [14:17:32] Was this supposed to be logging for a live job? It looks like it was killed. [14:19:42] ottomata, ^ [14:20:01] that task is probably done halfak [14:20:14] will find another [14:20:20] here's bob's job: http://localhost:8888/jobbrowser/jobs/application_1424966181866_89807 [14:20:20] Oh I see. It *was* a live task. No worries. [14:20:37] here's another mapper: [14:20:37] http://localhost:8888/jobbrowser/jobs/job_1424966181866_89807/tasks/task_1424966181866_89807_m_003329/attempts/attempt_1424966181866_89807_m_003329_0/logs [14:20:41] I just wanted to make sure that I understood that live task logs were expected to be available. [14:20:56] Anyway I can get you to talk to altiscale about setting something like this up. [14:21:05] suuuuure :) [14:21:12] i mean, its hue [14:21:16] easy to set up: [14:21:20] I am often blocked for up to 20 minutes/run waiting for a job to fully crash so that I can find out why [14:21:26] http://gethue.com/ [14:21:37] And you know how it goes. I'll run a job 10 times before it works. [14:21:46] So, it takes 3 hours just to debug a new job. [14:23:15] eyah i know [14:25:48] ottomata, I'd like to file a ticket with altiscale requesting hue. Anything you suggest I say other than "I want to debug faster. Please setup hue."? [14:33:29] say, hue seems like a easy way to look at some logs of running jobs [14:33:38] that is what I really want [14:33:46] hue, or some other easy way to do that please [14:36:24] ottomata, thanks [14:52:18] Ironholds: Spotify ads are really harshing my Portishead chill [15:52:30] hi everyone. [16:03:17] * guillom waves. [16:03:34] o/ [16:04:22] morning [16:04:23] how goes? [16:04:58] uhm [16:04:59] ottomata, ? [16:05:11] do you know why a job would just up and drop all of the mapping it had done and restart wordlessly? [16:05:19] 2015-04-17 14:11:35,958 Stage-1 map = 65%, reduce = 0%, Cumulative CPU 1332767.52 sec [16:05:19] 2015-04-17 14:12:09,606 Stage-1 map = 0%, reduce = 0% [16:05:19] 2015-04-17 14:13:09,673 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 104.29 sec [16:06:38] preemption? [16:06:45] did those maps get preempted? [16:06:56] which job Ironholds? [16:08:02] I don't know what preemption is [16:08:22] and the job ID is now out of my screen's buffer, because welcome to Hive, we have terrible logging :/ [16:08:28] but I can give you the query? [16:15:36] Ironholds: goes well. how are you? [16:16:01] eh, same as always. [16:16:23] o/ Nettrom [16:16:35] g'morning leila [16:16:52] mornin' halfak. [16:17:48] Wooo! It looks like my cleaned up diffs are ready for the next stage *and* the form builder now has sane code. :) [16:19:00] nice :D [18:08:09] ottomata, ^ ;p [18:29:14] Ironholds: do you ever look at this page? [18:29:15] https://yarn.wikimedia.org/cluster/scheduler [18:29:23] this one is yours [18:29:23] https://yarn.wikimedia.org/cluster/app/application_1424966181866_89048 [18:29:41] i see it started a new application master for you, so the whole job did start over [18:30:51] but ja, that is a really big query oliver [18:31:00] well, yes [18:31:01] it isn't currently running any mappers [18:31:02] https://yarn.wikimedia.org/proxy/application_1424966181866_89048/mapreduce/job/job_1424966181866_89048 [18:31:11] whatcha doing? [18:31:27] fulfilling a research request for per-second(mobile,desktop) pageviews over as much data as we have [18:31:38] can you sample? [18:31:53] oh i saw that email thread [18:32:13] or, maybe run them in batches? like daily? [18:32:17] and do it iteratively? [18:32:29] query? [18:32:51] you are basically saying to the cluster: I NEED ALL THE RESOURCES GIMME [18:32:58] and the cluster is like: Uhhhh, maybe a little bit at a time, i dunno [18:33:52] gotcha [18:33:56] daily it is, and I'll just append [18:34:08] hive -f daily_pageviews.hql >> pageviews [18:34:09] ja, you are goruping by second, right? [18:34:12] da [18:34:18] ja, you could even do by hour then [18:34:26] might be even easier to recover from then [18:34:30] lots of small jobs, ja? [18:34:33] well, small-ish [18:34:38] if I wanted to rewrite the query and trigger it 24*60 times. [18:35:41] running daily and we'll see what happens. Thanks :) [18:35:56] wouldn't be hard, just submit it in aloop, but ok! [21:39:13] halfak: I'm reading the GroupLens piece on 'misalignment between supply and demand'... [21:39:50] * halfak waits for thoughts [21:39:52] and I wonder: have you or anyone you know of done work on how to calculate the scope of an article? [21:40:08] Sort of. [21:40:19] like, 'NATO' and "Vietnam War" are crappy and have high reader demand. [21:40:20] So there's this work that b. Hecht did to compare articles across languages. [21:40:53] so if only people would ahve taken the time to write two FAs about battleships and made those to FA instead... except that the scope makes those two really really hard articles to do well. [21:40:58] One of the problems is that a topic may be covered by one article in one wiki might and broken into many articles in another. [21:41:23] ragesoss_, I think I might be thinking of another definition of "scope" [21:41:53] what I mean is, 'is this a really broad, high-level topic, or a really narrow specific one? [21:41:57] ragesoss, are you getting at some measure of "FA potential"? [21:42:16] ragesoss, gotcha. [21:42:44] hecht's work build a sort of heirarchy, but it didn't give you an indication of absolute depth. [21:43:42] I wonder if one could do a similar thing for 'importance' as the Wiki-Class does for class. [21:43:56] ragesoss, I think we can do that quite well. [21:44:06] https://meta.wikimedia.org/wiki/Research:Measuring_article_importance [21:45:32] halfak: yeah, but by 'importance' I meant something closer to the en-wiki 1.0 scale version of it, which ... if not 'scope' as I defined it above ... at least closer to it. [21:46:49] (I think the term "importance" fits better with something like you do in that page.) [21:48:28] (And that kind of 'importance' is probably what's most relevant in terms of misalignment of effort. It just got me thinking about the scope issue, because it's one of the things that matters a lot for the education program... finding a narrow-enough scoped topic that a student can be expected to develop a good general overview of the topic within the timeframe of the course.) [21:50:47] ragesoss, yeah. that's what I was using for "importance" -- the WP 1.0 scale. [21:51:25] It turns out that view rates correlate strongly with WikiProject assessed importance. [21:52:13] However, inline structure seems to reflect encyclopedic notions of "importance" better than view rate. [21:52:27] E.g. "Breaking Bad" gets 10 times as many views as "Chemistry" [21:52:35] inline structure as in, inbound links and redirects? [21:52:42] "Chemistry" has 10 times as many incoming links as "Breaking Bad" [21:52:50] ragesoss, that's right [21:53:06] We can do some more interesting things like page rank. I haven't gotten there yet. [21:53:48] This is super exciting. I'm trying to work out just how I'd like to build each of these into the course platform. [21:55:27] I'm really amazed at how far things have come just in the last year in terms of actionable analytics for trying to solve our hard mission problems. [21:56:37] btw regarding the misalignment https://upload.wikimedia.org/wikipedia/commons/8/8c/Wikipedia%E2%80%99s_poor_treatment_of_its_most_important_articles.pdf [21:57:27] ragesoss, :)!!! Now my question is, how would you like to access this data? [21:57:52] An API that let's you ask, "How important is the article with this ?" [21:58:01] halfak: through a fast API that I can query by, eg, revision_id or page_id or global_id [21:58:10] Or maybe you'd like a query service to return top importance articles in a category. [21:58:20] global_id? [21:58:27] or user_id [21:58:53] but we're doing things with OAuth, so we already deal with global_id and potentially want to get cross-wiki info about a given user. [21:59:07] as editor, i would *love* to have a gadget that adds an importance score (or just avg pageview number) to each entry on my watchlist [21:59:12] Commons + en.wiki, at the minimum, but potentially other wikis as well. [21:59:22] Gotcha.. So you might want to ask, "How important are the articles this user has been contributing to?" [21:59:50] HaeB, it seems like this is do-able in the short term if you don't mind data being out of date for new pages. [21:59:51] halfak: have you seen our dashboard system? http://dashboard.wikiedu.org/courses/University_of_North_Alabama/480_Geotectonics_(Spring_2015)/articles [22:00:30] ragesoss, cool! I want to give you thinks to plug into that :) [22:00:54] ragesoss, did I ever talk to you about https://meta.wikimedia.org/wiki/Research:WikiCredit? [22:01:05] halfak: no. I was browsing it, though. [22:01:10] I'm not too excited about designing yet another dashboard. [22:01:14] halfak: that should be fine... might actually follow up with you on that [22:01:17] I really just want to get the measurement out to people. [22:01:37] HaeB cool! I'll want something like that for building the WikiCredit system anyway. [22:01:43] So, it's on my to-do list. [22:02:15] HaeB, could you drop me a note somehow. Maybe on the talk page for https://meta.wikimedia.org/wiki/Research:Measuring_article_importance? [22:02:16] halfak: we're hoping that our dashboard can also be a generally useful tool in the ecosystem, for people running any sort of coordinated editing project. I like building this dashboard! [22:02:30] I'd like to think about that more and talk pages are good reminders and places to drop mocks. [22:02:53] halfak: OK! might get next week [22:03:03] ragesoss, +1. I think we need a nice generalizable system and I think the design of this looks pretty awesome. [22:03:14] How might I tie into it? [22:04:10] halfak: If there are APIs for these things, we'll start incorporating them in the second half of the year. [22:04:29] First, we need to get the system to the point where it runs without depending on the EducationProgram extension. [22:04:41] we'll have that ready in time for the fall term. [22:05:32] halfak: if you want to see under the hood, https://github.com/WikiEducationFoundation/WikiEduDashboard [22:06:02] halfak: I also have a few instances of it running on labs, one of which is pointed to sv.wiki [22:07:11] halfak: have you done anything with the value of images? [22:07:23] ragesoss, gotta get back to other stuff, but this is very cool and something I'd love to provide some APIs for. [22:07:34] ragesoss, negative, but I bet there's something reasonable we could do. [22:07:58] e.g. inclusions in articles * weighted by the importance of articles [22:08:08] that's something I started poking at... as an easy starting point, just getting globalusage counts for mainspace. [22:08:13] halfak: exactly. [22:08:13] The hard thing there is: no ground truth [22:09:31] yeah. survival over many edits and many views is maybe the best you can do. [22:09:55] ragesoss, I'll get back to you with some ideas for what I think the API might look like as soon as I have a change to sit down with it. [22:10:21] halfak: awesome. [22:32:04] halfak: hey; I need to extract URLs included in ref tags and count them by domain. Will I save time if I try to use some of https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia , or am I better off doing something simpler? [22:33:04] guillom, do you want them historically or on the most recent revision? [22:33:12] halfak: most recent [22:33:13] Also is it critical that they are in tags? [22:33:40] halfak: Probably. Ideally I want only those in ref tags, but I don't know yet if it makes a significant difference [22:34:13] guillom, if you can live without ref tags, there's a table in the db [22:34:17] "externallinks" [22:34:41] if you need them to be inside of tags, I'm planning to *start* some work this weekend that will get a historical account of all tags. [22:35:13] I could dump out the most recent tag situation while I am at it, but I can't guarantee being done before Monday. [22:35:29] halfak: Yeah; My problem is: Either I only want hose in ref tags, or I need to prove that non-ref links are not significant (which I doubt). In any case, just using the DB won't suffice :) [22:35:40] those* [22:35:53] Gotcha. [22:36:21] So, I think you can write a ~30 line python script to extract this that will run on stat1003 in about 10 hours -- maybe less. [22:36:30] I can help. [22:36:55] Thanks; I'll look into that then. [22:37:39] halfak: I have no idea what stat1003 is though. Is that like Labs? [22:37:54] heh. it is one of the computing machines we use internally. [22:38:10] oh. In order not to have to do this on laptop? [22:38:17] Indeed. [22:38:20] 16 cores of SPEEED [22:38:22] :) [22:38:26] I see :) Thanks [22:38:29] Soon, we'll be doing this in hadoop. [22:38:41] So, when you query a DB, do how do you do that? [22:38:52] Labs? [22:39:38] So far I've been using Labs. Because I needed up-to-date results (for MrMetadata) or because it was small queries (either with a one-liner, or with Quarry). [22:40:10] I haven't needed to learn a ton of SQL until now. [22:40:41] I know enough to be dangerous, and I know little enough to be dangerous :p [22:41:02] Anyway; thank you for the pointers! [22:41:03] guillom, how's your python chops? [22:41:23] halfak: barely better; self-taught in a week. [22:42:14] Ask me about microlithography masks or how to do nanopillars, though, and I can do it my eyes (almost) closed! [22:42:28] Gotcha. Could you drop an email to analytics-l with what you are looking for so that I can reference it later? [22:42:29] But I'm a quick stufy. [22:42:36] study, too [22:42:36] * halfak doesn't know any of those words. [22:42:49] :) [22:43:52] halfak: I'll look into it and will reach out if I get stuck. And if I get it to work, you can look at my code and be horrified ( http://xkcd.com/1513/ ) [22:44:01] woot :) [22:44:36] * guillom should wrap up his desk and go catch that bus. [22:44:37] bbl [22:44:39] Thank you! [22:49:34] o/ [22:59:57] There goes the battery. [23:00:01] Have a good weekend folks [23:00:01] o/