[03:11:13] Analytics / Refinery: Hive is broken on stat1002 - https://bugzilla.wikimedia.org/70203#c5 (Toby Negrin) Christian -- I ran a hive query and redirected output to file -- thus I thought hive was running :( Totally agree -- Hive is not a production service and there is no expectation of off-hour suppo... [14:17:21] (PS13) Nuria: Project and language choices component [analytics/dashiki] - https://gerrit.wikimedia.org/r/156741 [14:22:41] happy communist day, staffers! [14:50:51] (CR) Milimetric: Project and language choices component (5 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/156741 (owner: Nuria) [14:51:53] (CR) Nuria: [C: 2 V: 2] Add metric selector [analytics/dashiki] - https://gerrit.wikimedia.org/r/157170 (owner: Milimetric) [15:09:07] (CR) Nuria: "Agree with your comment, let me make changes to 'defaultSelection' field." (4 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/156741 (owner: Nuria) [15:12:27] (PS14) Nuria: Project and language choices component [analytics/dashiki] - https://gerrit.wikimedia.org/r/156741 [15:32:35] yoo Ironholds, you around? [15:32:48] (hi labor day workforce! :) ) [15:33:05] holaaaa ottomata [15:33:11] ottomata, I am! Trying to work out how an embedded NUL got in a user agent and what the bloodyhell I can do about it [15:34:16] uh oh [15:34:24] an actual NULL in the json [15:34:26] not a string? [15:34:32] (in hadoop?) [15:34:42] nope, the sampled logs [15:34:49] ah ok [15:34:50] phew [15:34:50] and, ASCII NUL. [15:34:54] not an actual NUL [15:34:57] *NULL [15:35:27] I'm hanging with my buddy Fabian, who is a data engineer at twitter, and we are going to try some fancy stuff with the cluster [15:35:30] not entirely sure what yet [15:35:32] cool! [15:35:35] but! we will play [15:35:37] I'm not using it at the mo so feel free [15:35:37] q for you: [15:35:49] what's something interesting that you might like us to try to generate? [15:36:02] hmnm. Like, a new dataset or a new UDF or... [15:36:03] this could potentially be something that uses both webrequest data + mediawiki databse stuff [15:36:09] oooh [15:36:11] i'm going to try to import some mysql tables maybe... [15:36:13] in that case, I have a perfect task [15:36:16] oh ja? [15:36:29] it's both an interesting problem that includes MW data /and/ a pending glamwiki/NARAA request [15:36:50] find all the requests that were to images in [category] [15:36:55] * Ironholds sits back and purses fingers [15:37:44] hmm, ok..... [15:37:58] so we need information on uploaded pages? [15:38:03] would the image page have the category in them? [15:38:13] i don't know the mediawiki database well [15:38:16] what table would that be in? [15:38:24] i'd need category table...aaannnd, what? [15:38:56] category and page [15:40:08] and then you'd do something like SELECT page_title FROM page INNER join categorylinks ON cl_from=page_id WHERE cl_to IN('list of categories go HERE') AND cl_type = 'file') [15:40:10] ok, what db slave should I connect to? [15:40:17] analytics-store? Probably the easiest. [15:40:29] the categories are all on commons if you'd like the actual list for the test :) [15:40:35] ooh. except..fuck. [15:40:41] * Ironholds headdesks [15:40:47] key-value. [15:40:50] ? [15:40:58] you wouldn't include files in [subcategory of category]. [15:41:18] I knew there was a reason I was planning on doing this the awkward way ;p [15:41:40] (CR) Milimetric: Project and language choices component (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/156741 (owner: Nuria) [15:41:41] not following...but ok? [15:42:03] Ironholds: what's a good smallish but not too small wiki to test queries out with? [15:42:15] arwiki is always fun [15:42:19] okay, so categorylinks is [page the link is on] [category] [what type of page the first page is], right? [15:42:40] so: Barack_Obama's_PageID, Presidents_of_the_United_States, page [15:43:09] Ironholds: a few more columns: http://www.mediawiki.org/wiki/Manual:Categorylinks_table [15:43:11] except, if your page is in a subcategory, the tie between subcategory and category is itself in categorylinks. [15:43:30] yeah, but unimportant ones for this exercise. [15:43:50] ok [15:44:01] for identifying "all the files in the category tree with CategoryX as a parent", it's awkward and hellish and requires recursion. Or at least every solution I've seen does. [15:44:07] So it may not work for this exercise :/ [15:44:08] so in that case type would be...categorylinks [15:44:09] or something? [15:44:15] (page type?) [15:44:22] subcat [15:44:23] i see it [15:44:35] yup [15:44:46] ok, well, we can figure that out as step two [15:44:59] can for now we use that as a use case, but only deal with ones where type = file? [15:45:04] leaves of the tree? :) [15:45:56] totally! [15:46:08] cl_from is a page_id, and cl_to is a category name???? [15:46:14] yep! [15:46:14] who created this schema!??!?! [15:46:19] welcome to MediaWiki: Everything Sucks Ass. [15:46:46] (CR) Milimetric: [C: 2 V: 2] Project and language choices component [analytics/dashiki] - https://gerrit.wikimedia.org/r/156741 (owner: Nuria) [15:46:51] You'll laugh, you'll cry, you'll beg otto and christian to get the hadoop cluster running so you never have to deal with MW data again [15:47:16] haha, you will not escape it [15:47:20] mw data will just get into the cluster... [15:47:55] can you transform it so the structure is not so terrible? ;p [15:48:40] (PS15) Milimetric: Project and language choices component [analytics/dashiki] - https://gerrit.wikimedia.org/r/156741 (owner: Nuria) [15:49:17] maybesure? you will certainly have to be involved in that if/when it happens [15:49:19] (CR) Nuria: [C: 2 V: 2] Add visualizer to coordinate selectors and graphs [analytics/dashiki] - https://gerrit.wikimedia.org/r/156722 (owner: Milimetric) [15:49:31] but, that is probably not a good idea, as things will just ge tmore confusing...more translation, etc. [15:49:34] but, who knows! [15:49:45] ayway, hm, ok so [15:49:46] i need [15:49:49] (CR) Milimetric: [C: 2 V: 2] Project and language choices component [analytics/dashiki] - https://gerrit.wikimedia.org/r/156741 (owner: Nuria) [15:49:50] page and categorylink tables [15:50:05] then I can join those with webrequest somehow and count number of requests in a category [15:50:26] wait, tell me again what we want to geenrate [15:50:31] number of requests to images in each category? [15:50:47] so, output would be: category, count [15:52:26] Ironholds: ^ [15:52:40] file, count, ideally [15:52:51] I'm not sure how you'd do that, though. I guess uri_path,count [15:53:15] oh, we just want type=file request counts? [15:55:17] ideally! [15:56:10] hmm, ok, so, i need to figure out how to associate a webrequest to page, then page to categorylink where type=file, and count all of those [15:56:11] ok [15:59:43] Ironholds: say hi to declerambaul [15:59:45] :) [16:00:35] hi declerambaul :) [16:00:55] hihi [16:02:56] (PS8) Milimetric: Add metric selector [analytics/dashiki] - https://gerrit.wikimedia.org/r/157170 [16:05:19] Ironholds: until my dying breath I shall push for a normalized, properly pre-aggregated analytics database that makes it easy to answer questions about mw dbs [16:05:50] milimetric, cool! [16:07:14] (CR) Milimetric: [C: 2 V: 2] Add metric selector [analytics/dashiki] - https://gerrit.wikimedia.org/r/157170 (owner: Milimetric) [19:06:12] Ironholds / kevinator: question about what wikis should be analyzed for vital signs [19:06:16] http://meta.wikimedia.org/w/api.php?action=sitematrix&smsiteprop=url|dbname|code&smstate=all&format=json [19:06:22] that URL ^^ has all the wikis [19:06:31] as you can see, some are "private" [19:06:36] and a lot are "closed" [19:06:40] http://meta.wikimedia.org/w/api.php?action=sitematrix&smsiteprop=url|dbname|code&smstate=all [19:06:47] that might be easier to see because it's XML [19:07:08] and some are "fishbowl" [19:07:42] http://meta.wikimedia.org/w/api.php?action=help this describes all the states a wiki can be in, search for "sitematrix" [19:08:04] closed - No write access, full read access [19:08:04] private - Read and write restricted [19:08:04] fishbowl - Restricted write access, full read access [19:08:25] so, I was going to include everything that's not private [19:12:13] milimetric, gotcha [19:12:22] you probably don't want to use sitematrix, actually [19:12:29] I'd recommend using the noc lists, which split it out distinctly [19:12:40] http://noc.wikimedia.org/conf/ [19:12:49] I wrote a script to gather this - one moment while I resurrect it... [19:13:22] Ironholds: aye, but we need the more detailed information there [19:13:31] okay, you want s1-7, minus private, closed, deleted, special and wikimedia [19:13:32] yeah [19:13:44] like language name, project name, etc. [19:13:51] so my suggestion is to combine the noc lists, exclude the aforementioned db list, and then grab the associated sitematrix data [19:14:05] noc has the metadata about state in a sane form, sitematrix has the metadata about everything else [19:14:12] it's not ideal but it is resillient to cluster changes [19:14:31] hm, but we don't need to know db information [19:14:36] oh, you don't? [19:14:37] hrm. [19:14:49] right, because they're always available on labsdb in the same format [19:14:52] gotcha [19:15:00] project.labsdb/project_p [19:15:04] ahh, but the API doesn't include all the states we care about [19:15:20] because some of the states don't matter from a MW point of view ('what's a "wikimedia"?') [19:15:34] bleh [19:15:39] states? projects? [19:15:48] okay, so, example. [19:15:58] Do we care about vital signs on the dutch chapter's wiki? [19:16:31] nope. So find me the special marking nl.wikimedia.org has in sitematrix that identifies it as "something we don't care about" ;p. [19:16:46] it's write-available, and public, but it's not "production" for R&D purposes. [19:16:55] right [19:17:09] the noc lists list it under the appropriate slave, but also list it in "wikimedia", i.e. movement wikis. [19:17:33] So starting from the noc lists and then excluding those found in some dblists - wikimedia, deleted, special, private, closed - will get you the "production" dbnames. [19:17:43] which we don't need at the end, but can use to match it up with the metadata in sitematrix. [19:17:52] right [19:17:54] an inner join, but in python :D [19:18:03] heh, yeah, the question is more what to include [19:18:07] ohh [19:18:09] data hacking is easy [19:18:15] "all production wikis"? [19:18:24] right, i believe what you say to be sensible [19:18:32] i'd like to confirm with kevinator [19:18:36] gotcha [19:18:39] kevinator: don't read up, just ping me when you get this [19:18:48] some chapters have data requests but it's all been read data thus far [19:19:26] i mean, it's easy enough to change how we do this, it's important to get consensus on where to start though [19:20:26] yeah [21:34:25] time to weigh in with a question: [21:34:50] what if some day the dutch are interested in how many active editors they have on their chapter’s mediawiki [21:34:51] ? [21:35:38] I think it’s easier to include non production wikis in EEVS [21:35:50] only exclude the ones that are “private" [21:36:05] milimetric: ^^ [21:36:59] kevinator, why? [21:37:24] and if some day the dutch are interested in that we can add it then. But it has no relevance to this actual project. [21:37:54] the EE in EEVS stands for something specific. Chapter activity is either in-addition-to editing production projects, or not in addition to, and either way it is irrelevant from an EE POV. [21:38:35] also, I'm not seeing how there's a substantive cost here. I could write a script to merge sitematrix and the NOCs myself in about 10 minutes, and I know because I've done it ;p. [21:39:31] Conversely, limiting to !private both means running kind of a lot of processing power over a large number of projects we will extract little or no value from. That's a big cost, albeit one machines rather than people have. [21:40:57] BTW is there really a dutch chapter wiki? I’m doing text searches for one in sitematrix [21:41:40] nl.wikimedia.org [21:41:44] or nlwikimedia [21:42:16] we host most of the chapter and user group projects, as well as 'production' and a lot of test sites that can arbitrarily cease to exist but are SULd and public. [21:42:41] I'm not entirely sure what happens to EEVS when a database it was reading from no longer exists but it's a question we may want to avoid answering ;p [21:46:40] do you have an estimate of how many wikis are non-production? [21:46:48] Ironholds: ^^ [21:46:59] I can get one; vun moment. [21:47:47] kevinator, 272 out of 883 dbs total. [21:47:50] So, Kind Of A Lot. [21:48:15] "private" would exclude 30 of them. [21:48:36] yeah and 128 closed wikis [21:49:06] yep [21:50:01] when I wrote the goals for this qtr, I stated 882 wikis because I didn’t know some were closed. [21:50:15] BTW any way to know which was the latest addition? we’re at 883 now [21:51:19] Is there a way to programmatically detect with one is added? [21:55:34] sorry i missed the ping, thanks Oliver [21:56:48] kevinator, if we use the noc lists as the origin it's automatically updated [21:56:59] or: not automatically, but good luck getting a db apache doesn't know exists to work :P [22:01:57] (PS1) Milimetric: Add dashiki config generation [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/157761 [22:02:40] nuria__: that last patch is generating the metrics and language / projects json files you should be expecting [22:02:53] you can use it if you want to get a more complete list and see how the autocomplete behaves, etc. [22:03:03] you just run scripts/admin dashiki [22:03:09] i will not get there today [22:03:14] i still have no styles [22:03:21] just interaction [22:03:23] and it puts available-metrics.json and available-projects.json in static/public [22:03:26] ok, no prob [22:03:28] just fyi [22:03:36] i'm gonna be done for the night, I think [22:03:41] But good work, we will start there on next pass [22:03:44] yep [22:03:58] i'll take a look at your patch tomorrow morning and ping you, or do styling if you're not around [22:04:04] ok, I will submit my patch by EOD [22:04:12] ciao [22:04:17] have a good night! [22:06:15] ciao i am going to take a walk and keep working on this after that [22:54:00] (PS1) Yuvipanda: Don't re-use sessions across celery runs [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157762 [22:56:07] (CR) Yuvipanda: [C: 2] Don't re-use sessions across celery runs [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157762 (owner: Yuvipanda) [22:56:13] (Merged) jenkins-bot: Don't re-use sessions across celery runs [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157762 (owner: Yuvipanda) [23:13:25] (PS1) Yuvipanda: Handle MySQL error about wrong name in SELECT [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157763 [23:13:30] (CR) jenkins-bot: [V: -1] Handle MySQL error about wrong name in SELECT [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157763 (owner: Yuvipanda) [23:14:28] (PS2) Yuvipanda: Handle MySQL error about wrong name in SELECT [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157763 [23:14:33] (CR) jenkins-bot: [V: -1] Handle MySQL error about wrong name in SELECT [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157763 (owner: Yuvipanda) [23:15:18] (PS3) Yuvipanda: Handle MySQL error about wrong name in SELECT [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157763 [23:17:08] (CR) Yuvipanda: [C: 2] Handle MySQL error about wrong name in SELECT [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157763 (owner: Yuvipanda) [23:17:14] (Merged) jenkins-bot: Handle MySQL error about wrong name in SELECT [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157763 (owner: Yuvipanda) [23:20:04] (PS1) Yuvipanda: Surface all internal errors we don't handle [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157764 [23:20:21] (CR) Yuvipanda: [C: 2] Surface all internal errors we don't handle [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157764 (owner: Yuvipanda) [23:20:31] (Merged) jenkins-bot: Surface all internal errors we don't handle [analytics/quarry/web] - https://gerrit.wikimedia.org/r/157764 (owner: Yuvipanda)