[15:10:32] \o/ Ironholds [15:10:38] Just saw your email re. A/B test [15:10:39] hey halfak :). How goes? [15:10:47] cool! Was it useful/good/bad? [15:11:00] * halfak kinda wants to see it on Meta [15:11:02] :)))))) [15:12:40] I know! The next A/Bs will be :/ [15:13:08] the first one...one of the process problems I elude to us being aware of is 'nobody told the analysts what we were testing, or when, or how, or anything, until it was deployed' [15:13:49] You guys [15:13:52] Sounds familiar [15:13:53] We have no article on psychosocial work hazards [15:14:11] Ironholds, can I copy-paste onto meta? [15:14:30] * YuviPanda sets some memetic hazards on hare [15:15:34] halfak, sure! [15:15:48] hare, wikipedia is one long psychosocial work hazard case study [15:15:57] It really is [15:22:49] * halfak begins the copy and prepares for the pasting [15:23:56] Ironholds, title proposal "Suggest it off: Reducing zero result searches" [15:24:30] halfak, "Reducing Zero Result Searches Through Confidence Changes" would work too [15:24:33] you should p r e s e n t a t w i k i c o n f e r e n c e u s a [15:24:43] * halfak copies [15:25:05] hare, I am o n t h e p r o g r a m m e c o m m i t t e e , I k n o w [15:25:13] o m g w o r d s a r e h a r d t o r e a d l i k e t h i s [15:25:22] I find them pretty easy, I just sit back a bit [15:25:41] hare, but I can see me doing a presentation on things we learned about A/B testing generaly [15:25:46] if work agrees to pay the travel ;p [15:25:47] W h a t h a s h a p p e n e d ? H o w d o I t u r n i t o f f? [15:25:57] When is the conf? [15:26:02] When is the deadline? [15:26:11] When will there be a post on wiki-research-l? [15:26:22] Deadline is August 31 [15:26:22] Can I sleep on someone's couch? [15:26:35] Well, that deadline is going to be rouch [15:26:37] *rough [15:26:37] (Deadline is August 31 but it takes all of five minutes to submit a proposal) [15:26:45] Oh.. maybe not. [15:26:47] (Also it may not be a firm deadline) [15:26:53] Could I propose to re-give a Wikimania talk [15:26:55] ? [15:26:56] Sure [15:27:02] Especially since I think I missed most of yours [15:27:22] Wikimania was horrible; filled with gastrointestinal illness and people crying [15:27:41] I didn't see the crying. I saw a lot of GI illness though :( [15:27:53] I was going to propose a talk at wcusa too [15:28:01] but my new religious beliefs prevent that [15:28:05] I saw the best minds of my generation destroyed by madness, [15:28:05] ? [15:28:34] my religion states that winter must be spent in san francisco, going to the office at regular intervals. [15:29:51] good morning Ironholds YuviPanda halfak et al. [15:30:03] o/ in b4 the et al. [15:30:06] :D [15:30:20] haha [15:32:50] hey bearloga :D [15:33:15] YuviPanda, october isn't winter [15:33:18] december-feb is [15:33:32] (the first day of spring is in march) [15:34:45] October is the only summer SF ever has. [15:34:59] I know, but why October? [15:35:16] Air flow patterns. Basically, the East Bay sucks... [15:35:26] ... all the cold air from over the ocean and brings to into SF [15:35:33] lol [15:35:36] Ironholds: true but my religion also says October to February must be spent in an apartment rented on one's own name] [15:35:36] Except in October [15:35:43] * halfak is Hilarius [15:35:50] I'm Innocent. [15:36:17] and I'm Leo I! [15:37:11] Pontification intensifies... [15:37:13] Ironholds, hard not to fix typos in the first revision. [15:37:23] halfak, do it, and submit a pull request ;p [15:37:34] :P First version should be a clean copy-paste [15:38:04] \o/ tag knows R [15:38:44] Would you be willing to do the commons upload for me? [15:38:48] Ironholds, ^ [15:38:56] It makes talking about the license easier [15:39:06] https://meta.wikimedia.org/wiki/Research:Reducing_Zero_Result_Searches_Through_Confidence_Changes [15:39:10] halfak, shoah [15:39:36] halfak: yeah, I think that might've been from ori's (and legoktm's?) recent work changing the engine used for source highlighting [15:40:00] I fully blame ori [15:51:53] What would you all like to see in a WikiProject API? [15:53:16] Ohai. [17:18:41] * halfak hammers stat3 with NICE'd processes [17:18:48] MWahahaha [17:20:44] ggellerman__, can you invite me to the SoS? [17:21:08] yep [17:21:36] Are we under "Infrastructure"? [17:21:51] Ahh yes. I see the template now [17:24:36] * halfak wonders what is meant by "mild test" [17:27:17] I'm running so many parallel processes on stat3 that they consume all of top. [17:31:41] halfak: we're moving close now to using wp10 scores in production. [17:31:51] \o/ [17:31:52] Wohooo [17:32:05] I'm working on producing a dataset of ALL THE SCORES now. :) [17:32:19] Oh? [17:32:32] Yeah. I needed it for a related research project anyway. [17:32:34] halfak: one of the things that I most want out of it is to also have the api provide some data about the *inputs* to the model. That is, whatever numbers for, like, citations, headers, and other tokens, that the model is using. [17:32:57] A wikilabels campaign? [17:33:03] ragesoss, Understood. You probably don't want the log-scaled values we actually use to make predictions though, right? [17:33:10] You just want the raw counts [17:33:15] halfak: probably. [17:33:33] halfak: the first one that would be useful is citation count, just so that we don't have the parse the text ourselves to get it. [17:33:54] Would a WP1.0 labels campaign make sense? [17:34:01] ? [17:34:05] For enwiki, no [17:34:14] halfak: log-scaled values might be okay, if they make sense for human-understandable heuristics. [17:34:26] You can re-scale them I suppose. [17:35:15] Caching the features might be a little gnarley. We might enforce re-generated scores when asking for features [17:35:21] halfak: the longer term idea would be to give suggestions based on the combination of those values and the score, to say things like 'you should add citation' or 'you should add headers' or 'you've got too many headers'. [17:35:25] Halfak I mean over our assessments. [17:35:38] We do not get feedback atm. [17:36:01] I have some feedback... [17:37:16] In my playing with the scores for lots of articles and revisions, I've found that the lowest scores do not come from completely blank articles, but instead from short pages with a standard structure, like redirects. [17:37:40] ragesoss sure. I think you want a machine generated FA review etc. Right? [17:37:46] We need a redirect class IMHO. [17:37:57] A simple script rather [17:38:02] In both cases (blank pages, and redirects) it'd be really good to have a 'not an article' class. [17:38:16] If redirect, assess as such. [17:38:34] it's nice to be able to predict that a given page would not even make it as a stub. Even if it's not necessarily a redirect, or a blank page. [17:38:42] Blank pages have bigger issues. [17:39:12] It takes nothing to make stub [17:39:38] Well, I want something that goes smoothly from a single character to a single (say) template call, to a single sentence, to a full article. [17:39:44] It takes more than nothing to make a stub. [17:39:44] My seat cushion would pass as a stub. :-P [17:40:06] If there is not even one sentence, it's not a stub. [17:40:25] How do you define a sentence? [17:40:42] What about an infobox? [17:41:04] I would handle these independently [17:41:13] White_Cat_mobile: if it has an infobox but no separate prose, on en-wiki it's going to get deleted as not even a stub. [17:41:41] Sure but I didn't do the assessment scale would complicate matters [17:42:05] but a full infobox is much closer to a stub than (for example) a sandbox where the only content is "{{user sandbox}}" [17:42:14] It would introduce significant computational load with little gain. [17:42:52] Fundamentally I do not disagree that these are a problems but assessment scale is not the way to address it in my humble opinion [17:43:20] You can identify such articles with a simple reg X [17:43:25] one of the things I'm trying to do is give people the right advice when they need it. and one point where they need it is, 'this draft is probably ready to be a real article'. [17:43:34] +1 on having a 'non-article' assessment class [17:43:54] the assessment scale, as implemented in wp10, is essentially a metric of 'structural completeness'. [17:44:07] I just do not see why we would train a machine learning classifier for this [17:44:27] it doesn't understand the article, so it's never going to be a real quality metric. But it measures really nicely how closely an article's structure matches what we consider mature articles. [17:44:56] How about this instead of assessing the quality of the article we assess its content does it have a sentence or not if not ignore [17:45:28] It would introduce significant computational load with little gain. [17:45:30] really? [17:45:38] halfak: ^ (lotsa backscrol when you're back :P) [17:46:01] White_Cat_mobile: we'll probably be doing those things, but only because (or only while) the wp10 scores behave badly for short pages. [17:46:34] Each new classification would introduce an exponential complexity for algorithms such as ECOC [17:47:16] \o/ So, suggestbot already has this model. I've been working with Nettrom to figure out how to pull it into ORES. [17:47:52] My plan is to use a small collection of naive bayes gaussian models [17:48:12] Right now, Nettrom does this in a highly manual way. I want to just use a few scikit learn models. [17:48:35] I think it will work great, but I haven't tired anything like it. [17:49:15] Worst case, we can replicate morten's work and just not have a Machine-Learned strategy. [18:00:02] I think it is important not to try to solve all problems with one model [18:00:19] WP1.0 is great for uality assessment [18:00:27] but beyond that it isnt as useful [18:00:36] that other problem can have its own model [18:00:51] it can take wp1.0 as a featur [18:34:28] +1 ToAruShiroiNeko [18:35:40] ragesoss, would you be happy to work with me on this type of model or would you rather have access to the raw features? [18:37:03] halfak: either way. seems like the kind of thing that could be useful more broadly, so building on a common model for this kind of use case seems like a good approach. [18:37:22] +1. I'm hoping to replace suggestbot's model first [18:37:34] The "task predictor" part :) [18:38:06] I do think that 'not-at-article' is something that would make sense as part of wp10. [18:38:29] it's already something that the actual process of rating articles takes into account, and just gets excluded from the current version of the model. [18:41:47] What would you use that for? [18:52:24] halfak: one of the main things we want to do is detect when student work is ready for mainspace, so that we can give them instructions right when they need them. [18:53:32] as I said, the main value I see in the wp10 scores is to indicate how structurally complete a Wikipedia page is (compared to our standards for a mature article) [18:53:56] a model that knew how to account for globs of wikitext that aren't anything like a real article would help it do that better. [18:54:12] "Ready for mainspace" is a bit different from what I was hoping to do with this model. I was hoping to have it note where obvious structural issues limit it from being an FA. [18:54:32] "globs of wikitext that aren't anything like an article"? [18:55:04] halfak: there's a continuum of potential, from stub all the way up to FA, in terms of using a model like this to provide useful direction to users trying to improve it. [18:55:27] halfak: like, for example, a userpage. [18:55:41] (a glob of wikitext that isn't trying to be an article) [18:56:12] Gotcha. But are user pages standardized enough for a model to be useful? [18:56:44] halfak: no. I'm not suggesting that you model user pages specifically (or that it would be useful to specifically identify user pages). [18:57:11] Are you suggesting we have a model for "looks like and article of some quality" as opposed to "doesn't look like an article"? [18:57:31] * halfak is missing something obvious [18:57:32] I think [18:58:06] halfak: let me put it differently. [18:59:08] what I want the model to be is a continuous variable: how much like a perfect article is this page? [18:59:24] Yeah. That. [18:59:30] Assuming all FAs are perfect [18:59:55] and I think it mainly falls short because it doesn't have a prediction for 'this is so far from perfect that it probably wouldn't even be rated a stub... it just wouldn't be an article at all' [19:00:18] Oh. Yeah, the only think that can be called "not even a stub" is a blank article [19:00:26] *thing [19:00:45] Because if we start trying to figure out if a talk page is an article, we need better AI/NLP. [19:00:46] halfak: well, you can also say that of (for example) articles that got deleted for certain CSD reasons. [19:00:53] Or we build is_article classifier [19:01:04] ragesoss, still articles [19:01:15] halfak: okay... but not stubs. [19:01:22] something less than a stub. [19:01:35] Yeah. Many deleted articles are more complete than stubs. [19:01:38] E.g. pokemon [19:02:12] I think the appropriateness of the material is different from quality. [19:02:20] You can have a high quality inappropriate thing. [19:02:55] I still think that we should feed a few blank/nearly blank pages into the model with a "blank" class. [19:03:23] I was thinking in particular of CSD A3: no content. [19:03:39] Yeah. That would work. [19:03:46] It's gonna have a honking big template on it though [19:03:53] (which may in fact have content: "Any article (other than disambiguation pages, redirects, or soft redirects to Wikimedia sister projects) consisting only of external links, category tags and "See also" sections, a rephrasing of the title, attempts to correspond with the person or group named by its title, a question that should have been asked at the help or reference desks, chat-like comments, template tags, and/or images.") [19:04:44] but that's something that I think the wp10 features would be appropriate to use to sort out "no content" wikitext from an article that would not get deleted as A3. [19:05:09] OK. Can be done. All we need are a set of observations. [19:07:18] The hard part about that is the sensitive data that is necessary to train the model [19:07:27] assuming the articles have been deleted. [19:07:32] that text is potentially sensitive. [19:07:45] It would be great to have a convenient solution there. [19:07:46] halfak: yeah. although with A3, probably mostly not. [19:07:58] I agree. [19:08:01] but it would still need to be vetted, I guess. [19:08:08] Yeah. [19:08:27] how big of a set would you need / want? [19:08:28] I mean, I can run it myself, but using the hit-by-a-bus principle, other should be able to run it as well. [19:10:26] halfak: yeah. what level of scrutiny would be required for screening out sensitive revisions? I mean, in many cases the deleted pages are present in dumps, right? [19:11:07] Yes. This is a good question. [19:11:19] The presence of deleted data in the dumps is something we officially ignore. [19:11:24] halfak: alternatively, it could be revisions that got tagged as A3 but not deleted. [19:11:38] But we're very careful about direct access to deleted stuff. [19:11:43] on the assumption that it got saved via adding content, rather than incorrect tagging. [19:11:49] ragesoss, maybe those tags are bad? [19:12:01] halfak: maybe. but maybe the ratings are bad. [19:12:12] how is that different from trusting the rating on a given revision? [19:12:24] Indeed, but something tagged A3 and not deleted suggests the tag was wrong. [19:12:51] halfak: so exclude cases where the tag got simply removed, with no other changes. [19:13:04] Hmm... Sounds complicated [19:13:41] That heuristic could be wrong, so we'd need to check. [19:13:57] I'm not sure what value we get between "blank" and "stub". [19:14:06] yeah. but it would let us gather a training set that we don't have to vet for privacy, just goodness. [19:14:32] halfak: the value would be recognizing 'non-content' content. [19:15:16] Would you be willing to go through a sample of drafts to flag them? [19:15:23] halfak: yes. [19:15:32] I bet that the AfC folk would be happy to have the support. [19:15:51] And if you copy the text somewhere we can work with it, then we don't have the deleted content worries. [19:16:07] Honestly, we could just release it as an open licensed dataset [19:16:12] That would work for me. [19:16:47] The work I'm doing on wikiclass involves an intermediate dataset that contains and