[08:08:01] (03PS1) 10Ladsgroup: DB_SLAVE -> DB_REPLICA [extensions/ORES] - 10https://gerrit.wikimedia.org/r/311221 [09:08:16] (03CR) 10Aaron Schulz: [C: 032] DB_SLAVE -> DB_REPLICA [extensions/ORES] - 10https://gerrit.wikimedia.org/r/311221 (owner: 10Ladsgroup) [09:09:29] (03Merged) 10jenkins-bot: DB_SLAVE -> DB_REPLICA [extensions/ORES] - 10https://gerrit.wikimedia.org/r/311221 (owner: 10Ladsgroup) [12:35:08] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 15User-Ladsgroup: Embed machine readable ores scores as data on pages where ORES scores things - https://phabricator.wikimedia.org/T143611#2645542 (10Ladsgroup) Okay, I looked into this. It seems hooks are not good enough to add attributes to rows... [14:43:26] 06Revision-Scoring-As-A-Service, 10MediaWiki-Special-pages, 10MediaWiki-extensions-ORES, 15User-Ladsgroup: Filter on user contribs has nowrap, causing issues - https://phabricator.wikimedia.org/T143518#2645619 (10matmarex) 05Open>03Resolved [15:46:12] o/ [15:46:16] Hey folks! [15:46:22] Who wants to hack some AI stuff? [15:46:30] I just finished up the work on Cross-validation [15:46:34] Now I think it's time to try to see if we can get the PCFG stuff working. [15:49:40] Checking on the monthly article quality dataset [16:01:22] hello halfak, can you explain more what you want to do with the article quality dateset [16:02:47] GhassanMas, right now, it's intended to be loaded by Elastic Search (the engine behind wikipedia's search) so that it can be used to re-rank result and provide additional information on the search page. [16:03:11] But I'm working on a dataset that will show quality changes over time. I'll be using that dataset to do some analyses of quality changes over time in Wikipedia. [16:03:42] I think it will be very interesting to split the article by major subject area and see where the quality growth rate was different. [16:04:28] One thing I have been thinking about doing is intersecting view rates with quality to see what the average quality of a pageview has been over time. [16:06:02] that's interesting, halfak [16:06:48] so by the end you suppose to get the rate of change of articles , e.g. positive, negative, constant. Then sorting them by subject [16:08:26] Platonides, as it stands, I should be able to do this in English, French and Russian Wikipedias. [16:08:35] It would be great if we could get a few more article quality models working [16:09:20] what about Spanish? [16:09:29] GhassanMas, yeah. That's roughly the idea. We'll essentially get a graph of where the X axis is calendar month and the Y axis is quality level (or quality weighted by views) [16:09:37] We can split this graph for subject area [16:10:41] * Platonides suspects both views and quality will drastically rise when $EXTERNALEVENT makes the article popular [16:12:20] Platonides, yeah. That'll be an interesting issue to account. At first, I think I'll just be looking at a single view rate measurement. As a second step, we might compare view rates month-to-month so we can account for sudden shifts in popularity. [16:12:22] We == me [16:12:24] :) [16:12:35] The academic "we" [16:12:45] But if I get some collaborators, it will really be "we" :D [16:14:11] :) [16:14:37] halfak: what all will it take to setup a platform where view rate vs. quality can be viewed/stored? [16:15:02] codezee, we're working to load a quality dataset into WMF labs right now. [16:15:25] It is ~5M rows for English Wikipedia, but it only contains the most recent quality assessments. [16:15:53] I'm working to load in a ~400-500M row dataset that will have monthly assessments for English Wikipedia, but the data is still generating. [16:16:41] If I can get this finished and loaded, I'll index the table by page_id and month so that it's somewhat tractable to query for individual pages within a timespan. [16:17:27] Getting views is not something I've looked at yet. But it should be true that table with monthly view rates would be the same *size* as the table with monthly quality assessments. [16:18:11] halfak: is there something in place recording monthly page views? [16:19:01] codezee, no, but we have hourly counts, so the hard part is rolling them up to monthly totals in an efficient way. [16:19:48] halfak: where are the hourly counts stored? [16:20:21] Some docs on storage and format: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites [16:20:35] There's some old formats to consider and deal with [16:20:45] And they only go back to 2007 [16:23:41] why would it be hard? [16:24:23] LOTS of data to process. [16:24:44] Also, the old data doesn't filter out web crawlers effectively [16:24:53] that's not hard, only slow [16:25:11] So we'll have sudden shifts in "view rates" that do not correspond to real shifts [16:25:20] We could just write that off as a limitation though [16:25:42] Platonides, I find that large scale data analysis is both hard (for it's own reasons) and slow. [16:26:08] E.g. gotta figure out a method to stream process because in-memory operations are impossible. [16:27:17] for each line prepare a SQL statement [16:27:27] when you have N statements, commit to the db [16:28:44] Platonides, oh my. Writing directly to the DB == mega slow [16:28:58] "when you have N statements," [16:29:00] I'd rather append to an output file for post processing. [16:29:03] you'll need to group them [16:29:10] That was we can compress as we write [16:29:15] otherwise it'll be crazy slow [16:29:20] Then secondarily, we can load a DB with all or a subset of rows. [16:29:44] Even batch loading the DB would be a major bottleneck. [16:30:00] Also, the file is portable, so I want that first [16:30:07] o/ sabya [16:30:15] o/ halfak [16:30:16] it's a crazy format :P [16:31:38] Platonides, all the cool kids are using hadoop these days for massive datasets. [16:32:14] Hadoop == pain in the ass & raw speed. [16:33:27] * Platonides doesn't speak hadoop [16:34:10] Platonides, do you speak stdin and stdout? [16:34:47] only read and wrtie them ;) [16:35:22] ha! Hadoop has a streaming interface that allows you to write mappers and reducers that read and write stdin/stdout. [16:35:32] I use that a lot. [16:35:51] E.g., I'll grep something in my mapper and use a python script as my reducer [16:36:03] :O [16:36:10] is that efficient? [16:36:31] Platonides, yup. It's on par with Java. [16:36:38] nice [16:36:44] Especially when I can get a core util like grep in there :) [16:36:48] Then it's WAY faster :D [16:38:44] :D [16:40:14] halfak: so I think the immediate goal would be to collect all pageviews data available in a standard format somewhere? [16:40:29] codezee, +1 [16:40:48] There's a library I started working on where I want to capture some nice, efficient pageviews processing code. [16:41:01] https://github.com/mediawiki-utilities/python-mwviews [16:41:09] Currently, it's only a wrapper for the pageviews API [16:41:23] But I think we want some powerful code for processing the dump files too. [16:45:50] 06Revision-Scoring-As-A-Service, 10revscoring: Implement PCFG features - https://phabricator.wikimedia.org/T144636#2645701 (10Halfak) I've been filing some bugs in `bllipparser`. See [#50 -- Large output dump to stderr when parsing a "word" with a space in it](https://github.com/BLLIP/bllip-parser/issues/50)... [16:49:14] halfak: anything i can help with? [17:12:33] sabya, sorry was AFK for a bit. [17:12:35] * halfak looks. [17:13:06] Oh! I have an idea. How about we try applying your hash vector work to the article quality models. [17:13:35] See https://github.com/wiki-ai/wikiclass/blob/master/datasets/enwiki.rev_wp10.nettrom_30k.tsv [17:14:01] This is the dataset we use to train the article quality models. I can get you a copy of the TSV with the basic features extracted. [17:15:22] halfak, ok [17:15:45] If you're kind of tired of this kind of work, we can definitely talk about other stuff too [17:17:16] I was awk, I work in retails... [17:17:39] curious what else do you have [17:19:00] sabya, been talking about engineering some efficient ways to process pageview dumps. [17:19:18] It might be nice to build up a multiprocessing strategy around that like the one in mwxml [17:19:54] E.g. http://pythonhosted.org/mwxml/map.html#mwxml.map [17:20:14] what's a pageview dump? you mean processing http access logs? [17:20:28] sabya, see https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites [17:20:42] Right now, the dumps are hourly, they have holes in them, etc. [17:21:12] So a dump processing utilities would do interesting things like interpolating empty periods and processing different file formats fluently. [17:22:27] * sabya opens the links [17:24:52] halfak, is this table contains the page Id and the class corresponding to that page ? [17:24:59] i meant this enwiki.rev_wp10.nettrom_30k.tsv [17:25:41] GhassanMas, that's right. [17:28:48] halfak:so what exactly are we trying to solve? [17:29:26] sabya, for views, I want to be able to easily produce a dataset that contains (, <month>, <total views>) [17:29:53] <halfak> I'd like to do so by doing multiprocessing over the dump files. [17:30:16] <halfak> We might want to provide some order guarantees, but that might end up being impractical. [17:31:23] <sabya> got it [17:31:49] <sabya> any approaches you have in mind? [17:34:02] <halfak> So, the mwxml.map() function isn't a true map() since the function provided is expected to `yield` [17:34:13] <halfak> but that gives a lot of power for stream processing. I'm imagining the same thing. [17:34:27] <halfak> Except we might want to provide a hour-file processing map() function. [17:35:35] <halfak> But we might want to provide iterators that are bigger than that. E.g. a day-mapper that would allow aggregations at that level. [17:38:07] <sabya> ok [17:47:18] <GhassanMas> changing place... [17:48:49] <halfak> sabya, I think a good first step is to write up a proposal for, how we could process these files efficiently using the single-powerful-server pattern [17:51:27] <sabya> makes sense, questions 1) where are these dumps generated 2) where are they supposed to be consumed from? 3) which project currently process them? 4) where is the process data stored? [17:51:48] <sabya> sorry for naive questions. completely new for me [17:56:15] <halfak> Dump generation code is beyond me. I'm not sure. [17:56:41] <halfak> We should consume them from the http source. We'll likely want to assume they have been downloaded though, so maybe we'd give the dump processing script a directory to work from? [17:57:30] <halfak> As for which project processes them, it seems that there are many that want to do a few different types of operations with the files. It's common to seek to just sum up all of the views for a set of titles in a time period. [17:57:37] <halfak> In our case, we want to aggregate up to calendar month. [17:58:32] <halfak> As for processed data, we'll likely want to store that on a large disc on whatever machine we are doing the processing on. For now, it would be fine to use the "/srv" mounts on the ores VMs in labs [17:59:04] <halfak> We'll want the utilities to operate as agnostic-ally as possible for all of these questions. [17:59:22] <halfak> Maybe it would be best to just *try* to aggregate up to a month and talk about what is generalizable. [18:04:41] <sabya> So, these views, how are they generated now? I mean are they generated in an efficient way? Or, this is the first time we are talking about generating them? [18:11:55] <sabya> correction above: I mean are they generated in an *inefficient* way currently? [18:13:13] <halfak> sabya, well, the view dumps are generated however they are generated. [18:13:25] <halfak> AFIAK, there's no monthly aggregation done for us. [18:13:32] <halfak> So we're starting from scratch there. [18:14:31] <sabya> ok [19:38:22] <wikibugs> 06Revision-Scoring-As-A-Service, 10revscoring: Implement PCFG features - https://phabricator.wikimedia.org/T144636#2645808 (10Halfak) I've cleaned up a lot. See https://github.com/halfak/pcfg, specifically [this commit](https://github.com/halfak/pcfg/commit/2bd80e3a86f3f441d11ad8edbc95f1522e394ec3). Here's a... [19:57:53] <halfak> Alright folks, I'm calling it quits for the day. [19:57:56] <halfak> Have a good one! [19:58:05] <halfak> And have fun :)