[17:19:30] halfak: could you ping me when you have a few minutes to chat about making an importance model available? [17:22:55] yes. Will do. 40 mins or so. [17:23:01] HaeB, I want to talk to you about https://meta.wikimedia.org/wiki/Research:New_page_reviewer_impact_analysis [17:23:45] sure [17:24:15] I understand you started pulling data. [17:24:33] * halfak ignores his meeting about NPP right now ^_^ [17:24:43] halfak: awesome, I’ll make sure I’m available then [17:28:44] halfak: yeah, a lot of people have pulled data actually ;) want to set up a call sometime? [17:29:29] Yeah. I think so. I'm hoping to start formalizing some things today. Maybe I can just ping you here with ideas/questions? Do you have any reports/logs for me to check out? [17:33:09] halfak: hmm you have seen the report danny published, and the associated phabricator task(s), right? [17:33:26] Report. not tasks. "Report" is kind of a mess. [17:33:28] IIRC there was really just one other piece of data i helped out with [17:33:33] gotcha [17:33:45] Seems Toby and Danny think you are continuing analyses. [17:34:08] maybe i should have been in that meeting [17:34:59] yes, i'm pulling two more pieces of data (as mentioned on the talk page), but i'm nowhere near the main author of that report ;) [17:35:04] * halfak invites. [17:35:31] thanks, but i really need to prepare for another meeting right now [17:35:39] gotcha [17:35:49] appreciate it though [18:09:06] o/ Nettrom [18:09:09] I need to make lunch. [18:09:11] o/ halfak [18:09:14] go make lunch [18:09:16] OK if I call from my kitchen? [18:09:40] sure thing, let me move to a slightly more meeting-friendly space [18:10:06] OK. Call me when ready [18:10:11] I'm stepping away from IRC :) [19:28:15] o/ milimetric [19:28:21] Did you work on some NPP analysis? [19:29:34] milimetric provided a query at https://phabricator.wikimedia.org/T149021 [19:30:45] hola! [19:30:45] * halfak needs to step out for another meeting but I'll be back soon and I'll read scrollback. [19:30:57] HaeB, is currently helping to give an overview of what he worked on. [19:45:57] musikanimal, what did you work on for NPP? [19:46:29] it's all in that gdoc. The big things I did was https://phabricator.wikimedia.org/P5480 [19:47:53] and https://quarry.wmflabs.org/query/19317 which is partly Jonathan's work [19:50:06] that's all I did for data, but Ryan and made a few patches to Page Curation, fixing bugs and re-introducing the top reviewers list https://en.wikipedia.org/w/api.php?action=pagetriagestats&topreviewers=last-month [19:51:28] I think I might work on this next https://en.wikipedia.org/wiki/Wikipedia:Page_Curation/Suggested_improvements#6._Welcome_messaage [20:00:02] OMG so many meetings [20:00:34] musikanimal, thanks for the link to that paste. I'll work through that before I sit down to get working. [20:00:54] FYI: this is what I'm going to be building: https://meta.wikimedia.org/wiki/Research:New_page_reviewer_impact_analysis [20:01:06] The goal is to put together a study that could pass peer review. [20:01:10] Coauthors welcome. [20:01:27] gotcha, thanks for putting working so hard on this! [20:01:35] *working so hard [20:01:56] that paste I made is still not 100% accurate, but I tried to get as close to useful numbers as possible [20:02:16] e.g. we don't want to count drive-by AWB editors who didn't actually work on the page in a reviewing capacity [20:39:46] milimetric: i was actually going ask you something about the redirect issues outlined e.g. here: https://phabricator.wikimedia.org/T149021#3313291 ... [20:41:22] milimetric: so i understand the mediawiki-history table (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history ) contains various fields that record the value of somethin both for when the event took place, and for today (i.e. snapshot time), e.g. page_namespace vs. page_namespace_latest [20:42:05] milimetric: it would facilitate the NPP analysis a lot if we also had page_is_redirect aside from page_is_redirect_latest [20:42:26] milimetric: any reason we don't? i'm going to file a ticket otherwise [20:42:28] HaeB / halfak: sorry I'm off today taking care of my wife. I'll be back Thursday [20:42:42] milimetric: oh ok! [20:42:57] milimetric: all the best, see you on thursday [20:43:18] HaeB: page redirect is impossible to get historically without parsing revision text, which we're planning on but it's a big job [20:44:16] If we're doing a one-off analysis, getting historical redirect status isn't too difficult for non-deleted pages. For deleted pages. errr. It's bad. [20:44:32] milimetric, do you know how to get it for deleted pages? [20:45:15] yeah, we would need it for https://phabricator.wikimedia.org/T166269 [20:45:55] milimetric: ok thanks! i'll still file the task i guess, but won't expect an immediate fix then [21:01:48] HaeB and musikanimal, thanks for your notes. I think that tomorrow I'll sit down with them, get caught up, and set up a work plan (for myself) for your review. [21:01:51] More on that tomorrow. [21:03:15] halfak: ok cool (and i think i sent you all the relevant links from my side by now ) [21:05:02] Right. Just finished reading through your messages in the other chat. I'll get back to you on efficiency once I've had some time to think through autoconfirmed and it's inconveniences. [21:07:40] halfak: i'll also send you the actual new query i'm using so far (a test version was at https://paws.wmflabs.org/paws/user/HaeB/notebooks/new%20pages%20scratchpad.ipynb but PAWS is still down... madhu is on it) [21:34:13] i'll just post it here: [21:35:06] https://www.irccloud.com/pastebin/PEx597no/English%20Wikipedia%20NPP%20backlog%20count%20by%20day%20and%20autoconfirmed%20status%20at%20time%20of%20creation [21:35:13] halfak: ^ [21:36:14] got it. Thanks' [21:36:51] halfak: i'm particularly interested in whether there are some indices i haven't made us of (i'm not very familiar with these tables).. [21:37:46] ...i'm also wondering whether it's worth creating an intermediary table for "tenthedits" (containing the timestamp of every user's tenth edit) [21:39:25] as it is, this takes about 2-4 hours on analytics-store for just ten days worth of data, and we need six months (although it' s probably not linear) [21:41:10] HaeB, not using the archive table. [21:41:23] A lot of edits that newcomers make tend to be to pages that are eventually deleted. [21:41:25] D: [21:41:59] yeah, i looked at some examples earlier [22:06:13] halfak: i just got curious and tried to find the actual piece of mediawiki code that determines autoconfirmed status - the initial commit is here https://phabricator.wikimedia.org/rSVN19376#b2f0aae0 , but i'm not sure where to find the current version [22:08:58] also, FWIW, the data lake queries (for article creation rates) crafted by milimetric and neilpquinn also use the number of surviving edits to determine the 10 edit threshoeld, see https://phabricator.wikimedia.org/T149021#3260176 (and modifications below) [22:16:16] "surviving edits"? [22:16:58] i.e. not yet deleted today (more precisely, at the time of the data lake snapshot) [22:23:24] Oh! the data lake doesn't have archived revisions? [22:23:58] it does [22:24:20] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history [22:25:32] hmm... so why use the surviving edits? [22:29:12] halfak: hm, i guess i misspoke... also, we should rather look at the latest version with corrections for that part: https://phabricator.wikimedia.org/T149021#3287887 [22:35:32] i think the problem there is rather that the count includes *non-surviving* edits at the time of the *event*. anyway, as mentioned, i was still going to look at that query more closely [22:36:35] back to the backlog query (for which we can't easily use the data lake anyway because the page triage tables are not available on hive): [22:37:45] so, assuming mediawiki indeed counts undeleted edits only for $accountEditCount, what's the best way to use the archive table here? [22:37:52] my first idea would be to modify the "ordered_data" subquery as follows: [22:38:06] JOIN the existing query for the revision table with rows from archive with ar_deleted = true and (requires another JOIN with 'logging') page deletion timestamp > edit timestamp [22:38:34] halfak: ^ does that make sense to you? [22:38:56] still doesn't catch some nasty corner cases (revision deletion etc) [22:41:37] There's no way that MediaWiki excludes deleted edits in an edit count [22:41:50] That would be very computationally intensive every time a page is deleted. [22:42:17] ok cool, that makes it easier... so i'll just replace 'revision' with 'archive', right [22:43:10] Archive has differently, but similarly named columns. [22:43:17] And different indexes :( [22:43:37] it's a bit surprising though... that means editors can't check autoconfirmed status based on public data, right? [22:44:50] yes, will change rev_timestamp and rev_user to ar_timestamp and ar_user [22:45:31] (rev_user_text was only there to make testing easier) [23:07:07] HaeB, yeah. Only status not history [23:07:24] HaeB, ar_user doesn't have an index in labs [23:07:30] Has an index on analytics slaves. [23:43:15] i see - i'm not using labs anyway (because of the query limit, see above) [23:43:30] halfak: speaking of indices, do you see a way to better make use of them in the query? [23:45:19] Sorry don't have time to look now. Will be reviewing everything tomorrow morning. [23:45:28] Taking off now. have a good one! o/ [23:45:42] halfak: ok, thanks for your input so far anyway!