[03:07:58] ha! [06:02:38] 10Scoring-platform-team, 10articlequality-modeling, 10editquality-modeling, 10revscoring, and 2 others: Add English Language idioms to revscoring - https://phabricator.wikimedia.org/T205545 (10HAKSOAT) Thanks for this [09:15:08] 10Scoring-platform-team, 10articlequality-modeling, 10editquality-modeling, 10revscoring, and 2 others: Add English Language idioms to revscoring - https://phabricator.wikimedia.org/T205545 (10HAKSOAT) @Halfak Please take a look at my Pull Request: https://github.com/wikimedia/revscoring/pull/466 [10:15:42] Hello halfak I made a PR this morning. [15:55:28] o/ apergos [15:55:35] hey ho [15:55:52] I made some comments on the XML dump RFC. Was hoping to chat to make sure I understand things right. [15:56:04] I'm looking at https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/docs/export-0.11.xsd [15:56:25] that's not yet what we use [15:56:26] and https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps [15:56:36] the plan is to go to that in a month (or so) [15:56:54] Right. I'm working on making sure that my popular XML dump processing python stuff is ready :) [15:57:24] It seems like the "sha1" field is missing from the new ContentTextType that appears inside of ContentType. <-- yep, it's been added by a recent patchset (or will be when that merges and is live, I forget) [15:58:07] Aha! Cool. [15:58:21] It appears as though every tag (ContentTextType) has a "deleted" field <-- I don't know how this plays out in practice, although there is no flag for content rows to be separately deleted,... [15:58:54] or what the future might bring, so we have the attribute just in case [15:59:02] Right. I can handle it either way. When I brought it up in #wikimedia-cpt, anomie told me that deletion is an all or nothing thing. [15:59:15] right now it is, and so is oversight etc [15:59:17] I'm cool with that. So I should expect it to be duplicated between all content. [15:59:26] yes but don't bake that in [15:59:27] "oversight"? [15:59:35] if you get one that's deleted and another that isn't, roll with that [15:59:45] oversighted/suppressed revisions [15:59:49] or now content [16:00:21] anything not visible to the public should not be dumped, and if the status changes mid-dump we can get some weird results [16:01:40] Gotcha. [16:02:59] Can we add DELECTED_RESTRICTED to the schema while we're refactoring? <-- nope. this rfc was approved Oct 2018, so anything new we want must go into a new schema change proposal [16:04:06] it's fine to start a discussion about fields you want at any time, but I'd suggest to wait until we actually switch over prod dumps to the 0.11xsd first, just because that will be consuming all my schema change cycles for now [16:04:23] Makes sense. [16:05:40] That page still says it is a "draft". Should that be changed? [16:06:07] as to backwards compat and slots etc, my s3kr1t plan for the next round is to jut toss the wrapping text tag and make all the slots at the same level... [16:06:17] nah, just leave it as is I'd say [16:06:26] people can go read the task to see what happened [16:07:34] so I was hoping that we'd roll out the new dupms by Feb 1 but if we slip it will be Mar 1 [16:08:34] there is the holiday recovery period and the all hands meeting both in this month so we'll see [16:36:44] (sorry jumped into meetings [16:47:34] no worries, just leaving the info here :-) [16:53:48] apergos, this movement towards having all of the slots at the same level is interesting to me. I'm not sure I understand though. Would we be dropping the main tag? [16:54:13] FWIW, I think that would be viable though slightly complicated. [16:54:38] well I haven't prposed this yet (see: schema chance cycles all in use) but yeah I'd like all the slots to look alike in the output [16:54:50] just that the main slot content would always be first [16:55:26] if will mean people really truly updating their tools but that's what happens with anything like this [16:55:46] by then they'll already have code in there to handle the slot/content structure for the rest [16:58:08] For anyone who is using my XML processing tools, I can maintain backwards compatibility at that level. There, it is no big deal. [16:59:27] 👍 [18:14:04] oh yeah almost forgot... RIP Python2.7 [18:14:34] halfak, we don't have anything that still targets py2.7 right? [18:23:42] nope! \o/. [18:33:02] right on [21:44:01] 10Scoring-platform-team (Current), 10NewcomerTasks 1.1, 10drafttopic-modeling: Re-train English Wikipedia topic model using new WikiProject Taxonomy - https://phabricator.wikimedia.org/T240286 (10Halfak) Regretfully, I was not able to get better signal from the new vectors. I'm not quite sure why. For this... [22:10:54] revscoring 2.6.3 is released! [22:11:02] Or the release is in progress. [22:26:16] wikimedia/revscoring#1802 (gensim - 1925042 : halfak): The build passed. https://travis-ci.org/wikimedia/revscoring/builds/632471971 [23:34:11] 10Jade, 10Scoring-platform-team (Current), 10Documentation: Write Jade extension documentation - https://phabricator.wikimedia.org/T229968 (10ACraze) I went through the extension docs today and made a few changes: - updated terminology to reflect the most recent schema and namespace changes (i.e. `NS_JUDG...